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FOREWORD 


This  effort  was  conducted  under  Contract  N00123-78-C-1206  with  the  American 
College  Testing  (ACT)  Program  within  work  unit  ZF 522.012.03.01,  Criterion-Referenced 
Testing  (CRT).  The  objective  of  this  work  unit  is  to  develop  and  evaluate  innovative  CRT 
techniques  to  alleviate  some  of  the  deficiencies  and  problems  that  exist  with  current 
procedures  used  in  the  Navy  training/ testing  community  (e.g.,  item-writing  methods,  item 
statistics,  generalizing  to  the  domain  of  performance,  and  computerized  adaptive  testing). 

The  purpose  of  the  ACT  effort  was  to  investigate  errors  of  measurement  in  criterion- 
referenced,  domain-referenced,  and  mastery  testing.  This  effort  has  been  conducted  in 
two  phases.  NPRDC  TN  80-15  (Brennan,  1980a)  reported  on  the  first  phase:  The 
development  of  a  computer  program  to  estimate  error  variances,  variance  components, 
and  indices  of  dependability.  That  technical  note  tells  testing  researchers  how  to  run  the 
program,  and  how  to  interpret  and  use  the  results  appropriately. 

This  technical  note  reports  on  the  second  phase:  The  development  of  a  handbook  of 
some  simple  statistical  techniques  for  producing  and  evaluating  criterion  and/or  domain- 
referenced  test  (DRTs)  for  Navy  technical  training.  It  is  a  "how-to-do-it"  handbook  for 
use  in  developing  and  assessing  CRTs  and/or  DRTs.  Specifically,  it  considers  item 
analysis  procedures,  techniques  for  establishing  cutting  scores,  errors  of  measurement  and 
classification,  test  length,  and  advancement  scores,  as  well  as  group-based  coefficients  of 
agreement. 

This  handbook  is  a  working  document  intended  for  limited  distribution  to  Center 
personnel  and  peers  in  the  scientific  community.  It  is  not  a  formal  presentation  of  Center 
research.  Parts  of  it  will  be  incorporated  into  a  larger,  more  comprehensive  testing 
manual  for  achievement  and  diagnostic  testing  that  is  being  produced  by 
NAVPERSRANDCEN  for  the  Navy  technical  training  community. 

The  contracting  officer's  technical  representative  was  Pat-Anthony  Federico. 

RICHARD  C.  SORENSON 
Director  of  Programs 


SUMMARY 


Problem 

- — {?■  Many  of  the  statistical  techniques  that  have  been  used  for  developing  and  evaluating 
norm-referenced  tests  are  not  applicable  to  criterion-referenced  tests  (CRTs)  and 
domain-referenced  tests  (DRTs)  since  the  data  from  these  later  tests  do  not  usually  follow 
the  normal  distribution.  Further,  CRTs  and  DRTs  are  not  used  to  compare  or  rank 
students  against  one  another;  rather,  they  are  used  to  determine  whether  students  have 
met  or  exceeded  mastery  learning  levels  or  absolute  performance  standards.  Statistical 
procedures  are  needed  that  can  be  easily  employed  by  developers  and  evaluators  of  CRTs 
and  DRTs  in  the  Navy.  ''\ 

Purpose 

The  purpose  of  this  effort  was  to  investigate  errors  of  measurement  in  criterion- 
referenced,  domain-referenced,  and  mastery  testing. 

Approach 

-  A  handbook  of  some  statistical  techniques  for  producing  and  evaluating  DRTs  was 
created  for  Navy  practitioners.  This  is  a  "how-to-do-it”  guide  for  the  intelligent  layman 
who  develops  and  assesses  DRTs  and/or  CRTs.  This  handbook  considers  item  analysis 
procedures,  techniques  for  establishing  cutting  scores,  errors  of  measurement  and 
classification,  test  length,  and  advancement  scores,  as  well  as  group-based  coefficients  of 

agreement.  ^ _ _ _ _ 

Results  and  Conclusions 

No  attempt  was  made  to  catalogue,  list,  or  describe  exhaustively  a  large  number  of 
available  procedures  for  a  particular  purpose.  Rather,  a  few  procedures  were  selected  for 
a  single  purpose  based  upon  the  principal  investigator's  judgment  as  to  which  are  the  best 
techniques  for  that  purpose.  Simple  numerical  examples  were  used  to  illustrate 
procedures,  and  guidelines  were  provided  for  using  and/or  interpreting  results. 
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Future  Direction 


This  handbook  will  be  incorporated  into  a  larger,  more  comprehensive  testing  manual 
for  achievement  and  diagnostic  testing,  which  is  being  produced  by  NAVPERSRANDCEN 
for  the  Navy  technical  training  and  testing  community. 
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1.  Introduction 


Almost  twenty  years  ago,  the  term  "criterion-referenced  testing" 
was  introduced  into  the  literature  on  educational  measurement,  and  since 
that  time  an  enormous  number  of  papers  have  been  published  that  deal  with 
technical  issues  in  this  area .  In  no  way  does  this  handbook  represent 
an  attempt  to  synthesize  ill  of  this  literature;  rather,  this  handbook 
treats  a  restricted  set  of  statistical  procedures  for  addressing  some 
of  the  most  prevalent  technical  issues  that  arise  in  criterion-ref¬ 
erenced  testing,  which  is  frequently  called  "domain- referenced  test¬ 
ing." 

Throughout  this  handbook  the  term  "domain-referenced”  will  be  used  in¬ 
stead  of  "criterion- referenced"  for  two  principal  reasons.  First,  the 
term  "criterion-referenced"  too  readily  suggests  some  external  criterion 
against  which  examinee  performance  on  a  test  can  be  compared.  There  are 
situations  in  which  an  external  criterion  exists  and  relevant  data  are 
available.  However,  such  situations  are  rare  in  this  author's  experience; 
and,  indeed,  none  of  the  procedures  discussed  in  this  handbook  require 
criterion  data,  in  the  usual  sense  of  the  word  "criterion."  Second, 
in  this  handbook  it  is  assumed  that  the  items  in  a  test  can  be  viewed 
as  a  sample  from  a  larger  universe  of  potential  items  that  might  have 
been  chosen  for  the  test.  It  is  natural  to  refer  to  this  universe  as 
a  domain — hence,  the  term  "domain-referenced." 


In  domain-referenced  testing,  the  examinee's  score  of  principal 
interest  is  the  examinee's  score  over  all  items  in  the  universe  of  items. 

This  score  can  never  be  obtained  directly,  but  it  can  be  estimated  by, 
for  example,  the  examinee's  observed  score  on  a  set  of  items,  or  test. 

Also,  in  domain-referenced  testing,  the  interpretation  of  an  examinee's 
score  is  not  based  on  the  scores  obtained  by  other  examinees.  In  a  sense, 
therefore,  the  phrase  "domain-referenced  testing"  is  itself  something 
of  a  misnomer,  because  what  is  of  principal  interest  is  domain- referenced 
interpretations  of  examinee  scores.  Such  interpretations  are  frequently  con¬ 
trasted  with  norm-referenced  interpretations  that  involve  comparing  the  perfor¬ 
mance  of  an  examinee  relative  to  the  performance  of  other  examinees . 

To  put  it  another  way,  even  highly  qualified  experts  would  have 
great  difficulty  distinguishing  between  a  norm-referenced  and  a  domain- 
referenced  test,  per  se;  and  the  procedures  for  administering  and  scor¬ 
ing  domain- referenced  and  norm- referenced  tests  seldom  differ  much  at 
all.  What  is  different  is  the  interpretations  given  to  the 

resulting  scores,  and  the  procedures  employed  to  study  the  quality  of 
these  interpretations.  Indeed,  in  principle,  scores  on  any  test  can  be 
given  either  domain-referenced  or  norm- referenced  interpretations,  al¬ 
though  this  is  rarely  done. 

Actually,  one  can  distinguish  between  two  types  of  domain-referenced 
interpretations.  One  interpretation  rests  on  using  an  examinee's  ob¬ 
served  score  on  a  test  as  an  estimate  of  his/her  universe  score.  The 
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other  interpretation  involves  comparing  an  examinee's  score  to  some 
fixed  cutting  score  that  is  defined  independently  of  examineee  test 
scores.  This  latter  type  of  interpretation  is  frequently  associated 
with  mastery /non-mastery  decisions. 

The  procedures  discussed  in  this  handbook  do  not  necessarily  repre¬ 
sent  the  most  technically  sophistocated  procedures  available.  Indeed,  the 
procedures  discussed  here  have  been  chosen,  in  large  part,  because  they 
do  not  necessitate  extensive  computations,  even  though  the  theoretical 
foundation  for  some  of  these  procedures  is  highly  technical.  Also,  no 
claim  is  made  that  the  procedures  discussed  in  this  handbook  treat  all 
relevant  issues  in  domain- referenced  testing,  although  they  do  cover  those 
issues  most  frequently  discussed.  The  general  intent  is  simply  to  pro¬ 
vide  practitioners  with  a  unified  treatment  of  some  relatively  straight¬ 
forward  statistical  procedures  for  use  in  domain-referenced  testing. 

Sample  Statistics 

In  this  handbook  all  computational  formulas  and  procedures  are 
provided  in  tables  that  include  examples  employing  synthetic  data.  In 
every  case,  the  computations  involve  nothing  more  mathematically  compli¬ 
cated  than  computing  sample  means,  variances,  and  standard  deviations, 
and  then  combining  these  quantities  in  various  ways. 

It  is  assumed  here  that  the  reader  is  already  at  least  partially 
familiar  with  the  concepts  of  a  mean,  variance,  and  standard  deviation. 

In  a  certain  statistical  sense,  a  mean  is  a  single  number  (an  average 
value,  or  a  "central"  value)  that  represents  an  entire  set  of  scores. 
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while  variance  and  standard  deviation  are  convenient  measures  of  the 

amount  of  spread,  or  dispersion,  in  a  set  of  scores. 

Table  1.1  provides  formulas  for  calculating  the  sample  statistics 

used  in  this  handbook.  To  give  the  statistics  in  Table  1.1  a  concrete 

interpretation,  the  formulas  for  them  are  expressed  with  respect  to  a 

person's  mean  score,  or  proportion  of  items  correct  (number  of  items 

answered  correctly  divided  by  the  total  number  of  items) .  In  this 

handbook,  a  person's  mean  score  is  represented  x  ,  where  the  "bar" 

P 

over  the  variable  x  signifies  a  mean  score,  and  the  subscript  p  sig¬ 
nifies  a  particular  person.  Specifically,  if  there  are  n  items  and 
x  represents  the  score  for  the  person  p  on  item  .i,  then  the  mean  score 
for  person  p  is  x  *  Z.x  ./n,  where  E.  means  "the  sum  over  items." 

If  one  wanted  to  express  these  sample  statistics  in  terms  of  a 

person's  total  score  on  a  test,  then  the  symbol  x  (without  a  bar) 

P 

would  be  used.  Also,  it  should  be  noted  that  what  is  important  is 
the  form  of  the  equations  in  Table  1.1 — not  the  fact  that  they  are 
expressed  in  terms  of  a  variable  x.  The  same  form  would  apply  if  the 
variable  were  labelled  y,  as  is  the  case  in  one  section  of  this  handbook. 

In  Table  1.1  two  formulas  are  provided  for  sample  variance — one 
2  ~2 

that  uses  the  symbol  s  ,  and  one  that  used  the  symbol  s  .  The  latter 
is  easily  obtained  from  the  former,  and  in  almost  all  cases  these  two 

statistics  will  have  very  similar  (but  not  identical)  values.  In  cer- 

2  "2 
tain  sections  of  this  handbook,  s  is  used,  and  in  other  sections  s 


is  used. 
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However,  as  far  as  this  handbook  is  concerned,  the  sole  reason  for 
2  ^2 

choosing  between  s  and  s  is  to  provide  the  simplest  possible  computa¬ 
tional  procedures  for  estimating  quantities  of  interest.  (A  similar 
statement  holds  for  the  corresponding  standard  deviations,  s  and  s.) 

It  was  mentioned,  above,  that  a  standard  deviation  is  a  measure  of 
the  amount  of  spread  or  dispersion  in  a  set  of  scores.  To  give  the  con¬ 
cept  of  a  standard  deviation  a  more  concrete  interpretation,  it  is  common 
practice  to  consider  the  standard  deviation  of  a  particular  bell-shaped 
distribution  of  scores,  called  a  normal  distribution.  As  illustrated 
in  Figure  1.1,  for  a  normal  distributions  (a)  68%  of  the  scores  lie 
within  one  standard  deviation  to  the  right  and  left  of  the  mean;  and  (b) 

95%  of  the  scores  lie  within  two  standard  deviations  to  the  right  and 
left  of  the  mean.  These  two  statements  also  can  be  expressed  in  terms 
of  what  are  called  "z-scores." 

As  indicated  in  Figure  1.1,  a  score  that  lies  one  standard  deviation 
above  the  mean  can  be  denoted  z  =  1;  and,  a  score  that  lies  one  standard 
deviation  below  the  mean  can  be  denoted  z  =  -1.  It  follows  that,  for 
a  normal  distribution,  68%  of  the  scores  lie  between  z  =  -1  and  z  =  1. 
Similarly,  95%  of  the  scores  lie  between  z  =  -2  and  z  =  2. 

The  above  statements  about  percent  of  cases  between  specified  z-scores 
do  not  apply  to  all  possible  distributions  of  scores.  However,  provided 
one  does  not  interpret  such  statements  too  literally,  they  can  properly 
serve  as  useful  bench  marks  for  conceptualizing  the  interpretation  of  a 


standard  deviation 


The  reader  is  cautioned  not  to  infer  from  the  above  paragraphs  that 
test  scores  are  usually  (or  should  be)  normally  distributed.  Indeed, 
for  domain-referenced  tests,  it  is  quite  common  to  have  many  high-scor¬ 
ing  examinees  and  relatively  few  low-scoring  examinees;  and  such  a  dis¬ 
tribution  is  not  normal.  For  this  reason,  most  procedures  treated  in  this 
handbook  involve  no  assumption  about  the  shape  of  the  score  distribution. 

Universe  of  Items 

A  universe  of  items  is  a  concept  of  central  importance  for  domain- 
referenced  interpretations,  because  ultimately  one  wants  to  make  inferences 
about  examinee  universe,  or  domain,  scores.  (Considerations  with  respect 
to  a  universe  of  items  are  prominent  in  some  approaches  to  norm-referenced 
interpretations,  too,  but  norm- referenced  interpretations  are  not  within 
the  scope  of  this  handbook.) 

Sometimes  there  actually  exists  a  set  of  items  that  can  be  considered 
as  the  intended  universe.  For  example,  some  computer-managed  instruction 
systems  have  a  large  bank  of  items  that  is  used  to  construct  specific 
tests.  Also,  the  words  in  a  specified  dictionary  might  constitute  a 
universe  for  a  spelling  domain. 

More  frequently,  however,  pragmatic  concerns  require  that  one  concept¬ 
ualize  a  uni verse  of  items  for  the  content  under  consideration.  For  exanple, 
in  the  initial  stages  of  developing  a  domain-referenced  testing  system, 
it  is  likely  that  only  a  limited  number  of  items  will  be  available.  Fur¬ 
thermore,  for  many  content  areas,  it  would  be  virtually  impossible  to 
construct  all  relevant  items,  or  even  a  large  proportion  of  such  items. 


In  such  cases,  it  is  especially  important  that  the  intended  universe  be 
defined  and  described  in  as  clear  and  unambiguous  a  manner  as  possible. 
Otherwise,  one  cannot  easily  claim  that  a  particular  item  does,  or  does 
not,  reference  the  intended  domain;  nor  can  one  clearly  specify  what  am 
examinee's  universe  score  means. 

No  matter  how  a  universe  may  be  defined,  in  this  handbook  a  test 
is  viewed  as  a  sample  of  items  from  an  intended  universe.  More  specif¬ 
ically,  to  be  technically  correct,  we  ought  to  say  that  a  test  is  a  random 
sample  of  items  from  the  universe,  in  the  sense  that  every  item  in  the 
universe  has  an  equal  chance  of  appearing  in  any  test.  In  practice,  one 
seldom  has  the  opportunity  to  randomly  select  a  sample  of  items,  in  the 
literal  sense  of  the  word  "randomly."  However,  if  a  universe  is  defined 
well  enough,  then  one  can  usually  ensure  that  a  test  consists  of  a  reason¬ 
ably  representative  sample  of  items  from  the  intended  universe. 

It  can  be  argued  that  for  every  objective  in  a  program  or  instruc¬ 
tional  sequence,  there  ought  to  be  a  distinct  universe  of  items.  It  is 
not  uncommon,  however,  for  a  test  to  reference  a  universe  that  might  be 
viewed  as  stratified,  in  the  sense  that  the  universe  is  defined  by  multiple 
objectives  or  the  multiple  categories  in  a  table  of  specifications  or 
task-content  matrix.  The  procedures  discussed  in  this  handbook  do  not 
specifically  incorporate  considerations  with  respect  to  a  universe  defined 
in  this  manner,  even  though  these  procedures  (or  similar  ones)  are  some¬ 
times  used  with  such  universes. 
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Overview 

No  matter  how  well-defined  a  universe  of  items  may  be,  the  quality 
of  the  decisions  made  can  be  no  higher  than  the  quality  of  the  items 
themselves.  Therefore,  Section  2  considers  some  simple  item  analysis 
procedures  for  using  data  to  help  identify  items  that  may  be  flawed.  This 
topic  is  rather  mundane,  and  the  process  of  performing  item  analyses  is 
tedious;  but,  in  this  author's  opinion  the  validity  of  a  domain-referenced 
measurement  procedure  absolutely  necessitates  using  good  items  that  repre¬ 
sent  a  well-defined  universe  of  items.  Furthermore,  no  after-the-fact 
statistical  analysis  of  examinee  test  scores  can  overcome  the  negative 
impact  of  poor  items  on  the  quality  of  domain-referenced  interpretations. 

Section  3  considers  a  rather  simple  procedure  for  establishing  a 
cutting  score,  ttq  ,  expressed  as  a  proportion  of  items  correct  for  the 
universe  of  items.  (In  this  handbook  the  Greek  letter  tt  is  used  to  repre¬ 
sent  a  score  for  the  universe  of  items,  whereas  x  is  used  for  a  score  on 
a  test,  or  sample  of  items  from  the  universe.)  This  procedure  is  "content- 
based"  in  the  sense  that  it  relies  upon  the  subjective  (but,  hopefully, 
well-informed)  judgments  of  content-matter  specialists. 

Section  4  treats  a  procedure  for  establishing  an  advancement  score. 

Recall  that  a  cutting  score,  tt  ,  is  expressed  as  a  proportion  of  items 

o 

correct  for  the  universe  of  items;  and,  as  such,  tt  is  "similar"  to  an 
-  o 

examinee's  universe  score,  it,  in  the  sense  that  both  it  and  it  reference 
-  o 

the  same  universe  of  items.  By  contrast,  an  advancement  score,  x  ,  is 
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"similar"  to  an  examinee's  observed  score,  x,  in  the  sense  that  both 
reference  a  test  score.  To  put  it  another  way,  an  advancement  score  is 
an  observed  score  analogue  of  a  cutting  score,  just  as  an  examinee's  test 
score  is  an  observed  score  analogue  of  his/her  universe  score.  A  decision 
concerning  mastery  is  actually  made  with  respect  to  the  advancement  score; 
i.e.,  an  examinee  is  declared  a  master  if  his/her  observed  score  is  at 
or  above  the  advancement  score. 

Section  5  considers  two  types  of  error  that  can  be  made  when  a 
decision  about  an  examinee  is  based  on  the  examinee ' s  observed  score 
rather  than  his/her  universe  score  (which  is  never  known) .  These  two 
types  of  error  are  called  error  of  measurement  and  error  of  classifi¬ 
cation.  Error  of  measurement  involves  the  extent  to  which  examinee  ob¬ 
served  and  universe  scores  differ;  and,  as  such,  error  of  measurement 
does  not  involve  consideration  of  a  cutting  score.  By  contrast,  an 
error  of  classification  is  made  if  an  examinee  is  erroneously  classified 
as  a  master  or  erroneously  classified  as  a  non-master. 

Section  6  considers  a  number  of  issues  associated  with  assessing 
the  quality  of  domain- referenced  measurement  procedures  for  a  group  of 
examinees.  These  issues  are,  in  part,  related  to  traditional  notions 
of  reliability  (or  measurement  consistency) .  Also,  to  an  extent,  these 
issues  have  a  validity  connotation,  because  in  domain-referenced  test¬ 
ing,  examinee  universe  scores  are  a  principal  "criterion"  of  interest. 
However,  the  terms  "reliability"  and  "validity"  are  used  only  infre¬ 
quently  in  Section  6  because  they  too  easily  connote  traditional  statis¬ 
tical  analyses  (for  norm-referenced  interpretations)  that  are  inappropriate 
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in  domain-referenced  measurement  contexts.  Rather,  emphasis  is  placed 
upon  certain  agreement  coefficients  and  group-based  measures  of  error. 

Restrictions  in  Scope  and  Content 

Domain-referenced  measurement  is  currently  a  topic  of  considerable 
interest  in  numerous  applied  settings,  and  a  handbook  such  as  this 
cannot  treat  all  relevant  issues  in  all  such  settings.  In  particular, 
there  are  many  important  educational,  philosophical,  legal,  ethical, 
and  technical  issues  involved  in  testing  for  licensure,  certification, 
"minimal"  competency,  etc.  For  the  most  part,  such  issues  are  not  treated 
here;  rather  emphasis  is  placed  upon  procedures  that  seem  to  this  author 
to  be  both  theoretically  reasonable  and  capable  of  being  used  relatively 
easily  by  practitioners — especially  practitioners  in  instructional  and 
training  environments  where  nothing  more  sophisticated  than  a  simple 
hand-held  calculator  may  be  available. 

Throughout  this  handbook  it  is  assumed  that  examinee  responses  are 
not  corrected  for  guessing.  In  several  cases,  the  procedures  discussed 
could  be  (or  have  been)  modified  in  various  ways  to  take  guessing  into 
account.  Such  modifications  are  not  treated  here  for  three  reasons. 

First,  many  such  modifications  make  assumptions  about  guessing  that  the 
author  believes  are  unrealistic.  Second,  reasonable  assumptions  about 
guessing  involve  complexities  considerably  beyond  the  scope  of  this 
handbook.  Third,  it  remains  to  be  seen  (in  a  research  sense)  whether 
or  not  procedures  involving  reasonable  assumptions  about  guessing  mater¬ 
ially  improve  the  quality  of  decisions  made  in  typical  domain-referenced 


testing  situations. 


In  the  field  of  statistics,  distinctions  are  carefully  drawn  between 
quantities  of  principal  interest,  called  parameters ,  and  estimates  of 
these  quantities,  called  statistics ■  For  theoretical  work,  this  distinc¬ 
tion  is  crucial,  but  to  incorporate  this  distinction  in  the  body  of  this 
handbook  would  necessitate  a  much  more  complicated  notational  system, 
as  well  as  considerably  more  conplex  verbal  statements.  Therefore,  the 
term  "statistic"  is  used  in  this  handbook  in  a  generic  sense  (even  though 
occasionally  the  word  "parameter"  would  be  better,  technically) ,  and  there 
is  no  notational  distinction  drawn  between  parameters  and  estimates. 

Also,  both  quantities  of  principal  interest  and  their  estimates  are 
usually  denoted  with  Greek  letters  to  distinguish  them  from  the  sample 
statistics  discussed  in  conjunction  with  Table  1.1.  Finally,  concerning 
notational  conventions,  sometimes  a  symbol  is  underlined  in  the  text  for 
emphasis  and/or  to  preclude  mistaking  it  for  part  of  a  word  or  phrase. 

The  body  of  this  handbook  does  not  contain  references  to  published 
work,  proofs  of  formulas  and  equations ,  or  justifications  for  choosing  the 
procedures  treated  here  rather  than  others  which  might  have  been  chosen. 
However,  to  a  limited  extent,  these  issues  are  treated  in  Appendix  B, 
which  is  provided  principally  for  the  technically  oriented  reader.  It 
will  be  evident  to  such  a  reader  that,  in  several  cases,  the  treatments 
of  procedures  in  the  body  of  the  handbook  are  slight  modifications  of 
procedures  discussed  in  published  literature.  Such  modifications  were 
made  principally  for  computational  convenience.  Furthermore,  in  a  few 
instances  procedures  are  presented,  or  suggestions  are  made,  that  have 
not  been  considered  previously  in  published  literature. 
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2.  Item  Analysis  Considerations 
In  domain- referenced  testing  (or  any  type  of  testing,  for  that 
matter)  there  is  no  substitute  for  good  items.  No  statistical  proce¬ 
dure  can  overcome  the  negative  effect  of  poor  test  items;  but  as  dis¬ 
cussed  in  this  section,  statistics  can  be  used  to  help  identify  poor  items. 

First,  however,  it  must  be  emphasized  that,  prior  to  collecting  any 
data,  every  effort  must  be  made  to  insure  that  items  reflect  the  objec¬ 
tives  they  are  intended  to  measure  and  that  the  items  have  no  obvious 
technical  flaws.  Such  judgments  are  best  made  by  content  matter  special¬ 
ists  who  have  knowledge  of  item  construction  procedures  and  guidelines. 

If  content-matter  specialists  do  not  have  such  knowledge  then  they 
should  be  aided  in  their  judgments  by  someone  who  does.  Also,  items 
should  be  reviewed  for  potential  bias  by  members  of  minority  groups, 
especially  when  domain-referenced  tests  are  to  be  used  with  members 
of  minority  groups. 

Item  Analysis  Table  and  Statistics 

No  matter  how  thoroughly  content  matter  experts  scrutinize  items 
to  eliminate  flaws,  it  is  always  advisable  to  study  examinee  responses 
to  items.  Such  data  provide  an  additional  check  on  item  quality.  Usually 
such  data  are  displayed  in  the  form  of  an  item  analysis  table  such  as 
that  provided  in  Table  2.1. 

To  give  a  context  to  the  synthetic  data  in  Table  2.1,  let  us  assume 


that  10  items  were  administered  to  50  examinees,  and  one  of  these  items 


Table  2.1 


Illustration  of  an 
Item  Analysis  Table  and  Statistics 
Using  Synthetic  Data 


Subgroup 


Alternative 

Low 

(0-6) 

Medium  High 

(7-8)  (9-10)  Total 

P 

B 

a 

3 

1 

2 

6 

.14 

-.13 

b* 

8 

9 

16 

33 

.75 

.18 

c 

2 

1 

1 

4 

.09 

-.10 

d 

0 

0 

0 

0 

.00 

.00 

Omit 

0 

0 

1 

1 

.02 

.05 

Not  Reached 

3 

3 

0 

6 

— 

— 

Total 

16 

14 

20 

50 

Total  minus 
Not  Reached 

13 

11 

20 

44 

— 

— 

(2.1)  p  = 

proportion 

alternative 

of  examinees  who  choose 
(or  omitted  item) 

(2.2)  B  = 

proportion  of  examinees 
in  high  group  who  choose 
alternative  (or  omitted 
_item) 

- 

proportion  of  examinees 
in  low  group  who  choose 
alternative  (or  omitted 
_item) 

e.g.  For  the  correct  alternative,  b, 
p  =  33/44  =  .75 

B  =  (16/20)  -  (8/13)  =  .80  -  .62  =  .18 


Numbers  within  parentheses  indicate  the  scores  (in  terms  of  number 
of  items  correct)  that  fall  into  each  group. 

Note .  *  indicates  the  correct  (keyed)  alternative. 
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resulted  in  the  data  in  Table  2.1.  Table  2.1  indicates  that  this  item 
contains  four  alternatives  with  the  correct  (or  keyed)  alternative  being 
b  (the  alternative  that  is  starred) .  Note  that  the  other  alternatives 
(namely  a,  c,  and  d)  are  sometimes  called  distractors ,  or  incorrect 
alternatives. 

To  study  examinee  performance  on  an  item,  it  is  usual  to  classify 
the  examinees  into  groups  based  on  their  test  performance.  In  Table 
2.1  this  has  been  accomplished  by  assigning  each  examinee  to:  (a)  a 
"low"  group  if  he/she  has  0-6  items  correct;  (b)  a  "medium"  group  if 
he/she  has  7-8  items  correct;  or  (c)  a  "high"  group  if  he/she  has 
9-10  items  correct.  For  present  purposes,  the  reader  can  assume  that 
examinees  in  the  high  group  would  be  judged  "successful,"  those  in  the 
low  group  would  be  judged  "unsuccessful,"  and  those  in  the  middle  group 
might  (or  might  not)  be  judged  "successful." 

The  entries  under  the  columns  headed  low,  medium,  and  high  are  the 
numbers  of  examinees  in  each  group  who  chose  each  alternative,  omitted 
the  item,  or  did  not  reach  the  item.  The  following  procedure  can  be  used 
to  distinguish  between  an  item  that  was  omitted  (but  attempted)  by  an 
examinee  and  one  that  was  not  reached  (and  unattempted) :  (a)  if  an 

examinee  omitted  the  last  item,  assume  that  the  examinee  did  not  reach 
one  item;  (b)  if  the  examinee  omitted  both  of  the  last  two  items  assume 
that  two  items  were  not  reached  by  the  examinee;  (c)  if  the  examinee 
omitted  all  three  of  the  last  three  items,  assume  that  three  items  were 


not  reached;  etc.  All  other  blank  responses  by  an  examinee  can  be  treated 
as  "omits." 

Table  2.1  also  includes  column  totals  indicating  the  total  number  of 
examinees  in  each  group,  and  the  number  of  examinees  in  each  group  who 
reached  the  item.  The  row  totals  in  Table  2.1  indicate  the  total  number  of 
examinees  who  picked  each  alternative,  omitted  the  item,  or  did  not  reach 
the  item.  Finally,  for  each  alternative,  Table  2.1  provides  two  statistics 
which  are  identified  as  £  and  B  and  defined  in  Equations  2.1  and  2.2,  respec¬ 
tively.  The  statistic  p  will  always  have  a  value  between  0  and  1,  and  B 
will  always  be  between  -1  and  +1. 

The  statistic  p  indicates  the  proportion  of  examinees  who  chose  an 
alternative.  For  the  correct  alternative  p  is  called  the  item  difficulty 
level,  and  it  is  the  proportion  of  examinees  who  got  the  item  correct.  In 
Table  2.1,  £  =  .75  for  the  correct  alternative.  Note  that  easy  items  have 
high  difficulty  levels  and  hard  items  have  low  difficulty  levels. 

The  statistic  19  indicates  the  difference  between  the  proportions  of 
examinees  in  the  high  and  low  groups  who  chose  an  alternative.  For  the 
correct  alternative,  B  is  called  an  item  discrimination  index.  It  reflects 
the  difference  between  the  proportion  of  examinees  in  the  high  group  who 
got  the  item  correct  and  the  proportion  in  the  low  group  who  got  the  item 
correct. 

Using  Item  Analysis  Data 

The  principal  use  of  item  analysis  data  in  domain-referenced  testing 
situations  is  to  detect  flawed  items.  It  must  be  understood,  however,  that 


such  data — no  matter  how  carefully  analyzed — do  not  provide  an  absolute 
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indication  that  an  item  is  or  is  not  flawed.  Also,  if  an  item  £s  flawed, 
the  data  cannot  tell  the  investigator  exactly  how  to  correct  the  flaw. 

What  the  data  can  do  is  flag  a  potentially  flawed  item  and  usually  sug¬ 
gest  the  nature  of  the  problem  and/or  the  part  of  the  item  that  is  flawed. 
Given  this  perspective,  the  following  paragraphs  provide  some  guidelines 
for  examining  item  analysis  data. 

(a)  Have  an  actual  copy  of  the  item  available  when  examining  an 
item  analysis  table  like  that  in  Table  2.1. 

(b)  Look  at  p  for  the  correct  alternative.  The  item  may  be  flawed 
if  the  item  difficulty  level,  p,  is  considerably  out  of  line  with  a  value 
one  might  expect.  (Usually,  in  domain- referenced  testing  items  have  rel¬ 
atively  high  difficulty  levels  if  they  are  obtained  for  a  group  of  exam¬ 
inees  who  have  experienced  instruction  in  the  content  tested.) 

(c)  Look  at  the  relationship  between  item  difficulty  level  and  the 
p  values  for  the  distractors.  If  a  distractor  has  a  value  for  p  that 

is  above  the  item  difficulty  level,  then,  examine  the  distractor  to  see  if 
in  fact  it  could  be  considered,  reasonably,  as  a  correct  answer.  If  so, 
one  of  three  problems  probably  exist — the  correct  answer  was  mis-specif ied, 
the  item  has  two  or  more  correct  answers,  or  the  item  is  ambiguous.  In  any 
case,  the  item  requires  revision. 

(d)  If  £  is  very  small  for  any  distractor  (e.g. ,  alternative  d  in 
Table  2.1)  consider  eliminating  it  or  replacing  it  with  some  other  incor¬ 
rect  alternative- -provided  doing  so  does  not  change  the  intended  nature  of 
the  item.  (Recall  that  if  an  item  is  inherently  easy,  it  is  very  likely 
that  one  or  more  distractors  will  be  chosen  infrequently.) 
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(e)  liook  at  the  item  discrimination  index  (the  value  of  B  for  the 
correct  alternative) .  It  is  very  unlikely  that  a  good  item  would  have 
a  value  for  B  that  is  noticeably  negative,  because  that  would  mean  that 
a  greater  proportion  of  the  low-scoring  group  got  the  item  correct  than 
the  high-scoring  group.  Therefore,  if  B  is  noticeably  negative  (say, 
less  than  -.20)  examine  the  item  carefully,  checking  especially  to  see 
that  the  item  was  scored  correctly,  that  it  is  unambiguous,  and  that 
the  indicated  correct  answer  is  indeed  correct. 

(f)  Look  at  the  values  of  B  for  the  distractors.  If  any  of  them 
are  noticeably  positive  (say,  above  .20) ,  check  the  item  to  see  if  it 
is  ambiguous,  or  if  the  distractor  could  possibly  be  a  correct  answer. 

(g)  If  either  p  or  B  for  "omits"  is  noticeably  positive,  examine  the 
item  for  ambiguities.  It  is  assumed,  here,  that  examinees  are  not  being 
penalized  for  guessing  and,  therefore,  there  is  no  extrinsic  motivation 
for  an  examinee  not  to  pick  an  alternative. 

(h)  Consider  the  number  of  examinees  (especially  high-scoring 
examinees)  who  did  not  reach  the  item.  If  many  examinees  did  not  reach 
it,  (e.g.,  see  Table  2.1)  the  item  may  be  all  right,  but  it  is  likely  that 

examinees  were  not  allowed  enough  time  when  they  were  tested.  Unless 

♦ 

a  domain-referenced  test  is  intended  to  be  speeded,  examinees  should 
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have  a  reasonable  amount  of  testing  time.  Otherwise,  the  examinees’ 
scores  will  not  adequately  reflect  their  ability. 

The  above  suggestions  should  be  regarded  as  reasonable  "rules-of- 
thumb” — not  dogmatic  directives.  No  such  rules,  and  no  amount  of  item 
analysis  data,  absolve  item  developers  and  investigators  from  employing 
common  sense  and  good  judgment  based  on  experience  and  content-matter 
knowledge . 

Other  Considerations 

In  norm- referenced  testing  contexts  it  is  not  uncommon  for  items 
to  be  discarded  or  revised  if  the  value  of  a  discrimination  index  is 
positive  but  small.  This  criterion  should  not  be  used  in  domain-ref¬ 
erenced  testing  contexts.  Indeed,  frequently  in  such  contexts  many 
good  items  are  virtually  guaranteed  to  have  positive  but  small  values 
for  a  discrimination  index.  Also,  in  norm- referenced  testing  contexts 
a  high  discrimination  index  is  frequently  viewed  almost  as  an  indicator 
of  an  ideal  item.  This  perspective  should  not  be  taken  in  domain-ref¬ 
erenced  testing  contexts — at  least  not  in  the  sense  that  highly  discrim- 
!  I  inating  items  are  preferred  over  moderately  discriminating  ones.  In  domain 

referenced  testing  situations,  emphasis  is  placed  upon  content,  and  discrim 
ination  indices  should  be  used  solely  as  an  aid  in  identifying  flawed  items 
not  a  basis  for  classifying  items  into  degrees  of  quality. 

1  In  an  ideal  world,  all  items  in  the  universe  would  undergo  item 


analysis  before  any  decisions  were  made  about  examinees  based  on  any 


items  in  the  universe.  This  ideal  is  seldom  feasible  in  practice. 

Even  so,  no  item  should  be  used  as  a  basis  for  making  decisions 
about  examinees  until  it  has  been  subjected  to  an  item  analysis.  To 
address  this  issue  the  following  procedure  can  be  used.  First,  in  the 
initial  stages  of  developing  a  universe  of  items,  prior  to  using  the 
items  for  decision-making,  a  reasonably  large  sample  of  them  should 
undergo  item  analysis  using  a  representative  group  of  examinees.  Items 
that  do  not  successfully  clear  this  hurdle  should  be  discarded  or  revised. 
Second,  to  gather  item  analysis  data  on  other  available  items,  or  items 
subsequently  developed,  one  can  include  a  small  number  of  them  in  opera¬ 
tional  versions  of  domain-referenced  tests.  However,  examinee  scores 
on  any  such  additional  item  should  not  be  used  as  part  of  the  examinee 
total  scores  for  decision-making — at  least  not  until  the  item  analysis 
data  have  been  studied  to  verify  that  the  item  has  no  obvious  flaws. 

If  the  above  approach  is  taken  of  including  new  items  with  old  items 
in  a  domain- referenced  test,  then  it  is  important  that  the  investigator 
not  confuse  the  total  number  of  "scored  items"  (those  not  undergoing  item 
analysis)  and  the  total  number  of  items  physically  in  the  test.  Else¬ 
where  in  this  handbook,  when  test  length,  n,  is  discussed  it  is  always 
assumed  that  n  is  the  total  number  of  items  excluding  those  (if  any) 
undergoing  item  analysis. 

As  discussed  above ,  conducting  an  item  analysis  usually  involves 
classifying  examinees  into  groups  based  on  total  test  score.  If  new 
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items  are  included  with  old  items,  then  total  test  score  should  be  based 
on  the  old  items  only.  Of  course,  i.i  the  initial  stages  of  construct¬ 
ing  a  universe,  or  pool  of  items,  total  test  score  will  have  to  be  based 
on  new  items  only.  In  either  case,  the  investigator  must  choose  a  range 
of  scores  associated  with  each  group.  Seldom  can  this  decision  be  made 
in  a  completely  unambiguous  manner,  because  a  firm  basis  for  this  deci¬ 
sion  would  necessitate  information  that  is  seldom  available  at  the  time 
the  decision  needs  to  be  made.  For  example,  in  initial  stages  of  uni¬ 
verse  construction,  a  cutting  score  may  not  have  been  firmly  established. 
Furthermore,  as  will  be  discussed  later,  even  under  the  best  of  circum¬ 
stances,  it  is  impossible  to  assign  examinees  to  groups  in  a  manner 
that  is  guaranteed  to  be  completely  devoid  of  error.  Even  so,  for 
item  analysis  purposes  a  firm  basis  for  assigning  examinees  to  groups 
is  not  absolutely  necessary — good  informed  judgment  based  on  experience 
is  generally  sufficient. 

The  above  discussion  of  item  analysis  procedures  has  been  couched 
in  terms  of  multiple-choice  items.  For  f ree-response  items  the  procedure 
and  guidelines  are  essentially  the  same.  The  principal  differences  are 
that:  (a)  a  f ree-response  item  can  be  viewed  as  an  item  with  two  alter¬ 

natives — correct  and  incorrect;  and  (b)  the  investigator  needs  to  study 
all  examinee  responses  to  make  sure  that  all  correct  responses  have  been 


identified. 


3.  Establishing  a  Cutting  Score 

One  of  the  initial  tasks  typically  encountered  by  an  investigator 
in  a  domain- referenced  testing  environment  in  to  establish  a  cutting 
score,  ,  expressed  as  a  proportion  of  items  correct  for  the  universe 
of  items.  Of  course,  is  not  required  if  mastery  type  decisions  are 
not  going  to  be  made  and  interest  is  restricted  to  estimating  an  exam¬ 
inee's  universe  score.  However,  in  most  domain-referenced  testing 
situations,  mastery  type  decisions  are  made  and,  consequently,  a  cutting 
score  is  required. 

On  rare  occasions  there  is  a  known  relationship  between  examinee 
performance  on  the  universe  of  items  (or  a  large  part  of  the  universe) 
and  some  external  criterion  such  as  on-the-job  performance  or  perfor¬ 
mance  in  some  subsequent  level  of  instruction.  Such  data  are  indeed 
rare,  however,  because  they  are  usually  very  difficult  to  obtain.  For 
example,  if  some  measure  of  on-the-job  performance  is  viewed  as  a  crite¬ 
rion,  then  one  would  have  to  take  the  following  steps  to  obtain  the 
data  required  to  use  such  performance  as  a  basis  for  establishing  a 
cutting  score:  (a)  test  a  representative  group  of  examinees  using  a 
large  number  of  items  from  the  universe;  (b)  allow  all  these  examinees, 
including  those  with  low  scores ,  to  undertake  the  job  under  considera¬ 
tion;  and  (c)  evaluate  the  performance  of  each  of  these  examinees  on  the 


job.  Three  problems  are  usually  encountered  in  attempting  to  carry  out 
these  steps.  First,  these  steps  are  usually  time-consuming  and  expensive. 
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Second,  it  is  frequently  considered  undesirable  (and  sometimes 
ethically  unacceptable)  to  allow  low-scoring  examinees  to  under¬ 
take  the  job  in  question.  And  third,  usually  the  evaluation  of 
on-the-job  performance  is  both  difficult  and  subject  to  consid¬ 
erable  error. 

For  these  reasons,  among  others,  external  criteria  are  seldom  used 
(at  least  directly)  in  the  process  of  establishing  a  cutting  score  for 
domain-referenced  testing  purposes.  Rather,  it  is  common  for  a  cutting 
score  to  be  defined  based  upon  the  judgments  of  raters,  judges,  or  experts 
who  are  content  matter  specialists.  Of  course,  such  judgments  are  likely 
(indeed  hopefully)  to  be  influenced  by  raters'  knowledge  about  potential 
external  criteria  and  about  how  persons  generally  perform  on  such  crite¬ 
ria.  However,  such  information  is  not  usually  quantified  directly. 

Rather  several  procedures  exist  for  eliciting  from  raters  their  beliefs 
about  how  minimally  competent  persons  would  perform  on  the  universe  of 
items,  the  argument  being  that  such  judgments  provide  a  basis  for  estab¬ 
lishing  a  cutting  score  ir^  that  separates  mastery  (or  probably  accept¬ 
able  performance)  from  non-mastery  (or  probably  unacceptable  performance) . 

Procedure 


In  one  procedure  for  establishing  a  cutting  score,  each  of  a  set  of 


raters,  judges,  or  content  matter  specialists  is  asked  to  provide  an  inde¬ 
pendent  assessment  of  the  probability  that  a  minimally  competent  examinee 
would  get  each  item  correct.  The  average  probability  over  raters  and 

items  (called  y  below)  is  frequently  used  as  the  cutting  score  ir  , 

o 

and  various  statistics  can  be  calculated  to  assess  how  variable  this 
average  probability  would  be  if  the  study  were  replciated  a  large  num¬ 
ber  of  times.  Knowledge  about  such  variability  is  important  in  reveal¬ 
ing  the  extent  to  which  raters  agree  in  their  judgments  about  what 
cutting  scores  should  actually  be  established. 

Using  this  procedure  data  are  collected  in  the  following 
manner : 

(a)  A  group  of  t  raters,  and  a  sample  of  m  items  from  the  universe, 
are  identified  where  t_  and  m  are  as  large  as  time  and  other  constraints 
will  allow; 

(b)  Each  rater  is  told  to  provide,  for  each  item,  a  probability 
reflecting  that  rater '  s  belief  about  the  likelihood  that  a  minimally 
competent  examinee  would  get  that  item  correct; 

(c)  Items  are  presented  to  each  rater  in  a  random  order — the 
important  point  being  that  the  items  are  ordered  differently  for  each 
rater; 

(d)  Each  rater  works  independently  of  e^ery  other  rater  (i.e., 
raters  do  not  discuss  their  judgments  with  each  other) ;  and 

(d)  Raters  are  told  to  report  their  probabilities  in  units  of 
1/10  (i.e.,  the  probabilities  that  might  be  assigned  are  0.0,  0.1, 


Table  3.1  reports  a  set  of  data  that  might  result  from  such  a  study 
with  t  =  5  raters  and  m  =  20  items.  Inese  numbers  are  relatively  small 
solely  for  the  purpose  of  simplifying  subsequent  illustration  of  com¬ 
putations.  An  entry  in  the  body  of  Table  3.1  is  denoted  y  ,  the  prob- 

rr 

ability  assigned  by  a  rater  r  to  an  item  .i.  (The  symbol  y  is  used  here 
to  distinguish  these  probabilities  from  examinee  scores  on  a  test,  which 
are  later  denoted  with  the  symbol  x.)  Along  with  the  probabilities. 

Table  3.1  reports  means,  variances,  and  standard  deviations.  For  example, 

(a)  an  entry  in  the  row  labeled  y^  is  the  mean  probability  assigned 
to  items  by  rater  r,  and  sfy^)  =  .083  is  the  standard  deviation  (across 
raters)  of  these  rater  mean  probabilities; 

(b)  an  entry  in  the  column  headed  y i  is  the  mean  probability  assigned 

A  — , 

to  item  i.,  and  s(y  J  =  .086  is  the  standard  deviation  (across  items)  of 
these  item  mean  probabilities; 

(c)  an  entry  in  the  row  labeled  sty^)  is  the  standard  deviation 
of  the  probabilities  assigned  to  items  by  rater  r;  and 

(d)  y  =  .80  is  the  mean  probability  over  all  20  items  and  all 
5  raters. 

In  a  cutting  score  study,  interest  is  usually  focused  principally 
on  yr  and  y.  We  may  call  yf  the  "cutting  score  assigned  by  rater  £" 
because  it  reflects  that  rater's  belief  about  the  proportion  of  items 
that  a  minimally  competent  examinee  would  get  correct.  Similarly, 
we  may  call  y  the  "study  cutting  score,"  and  as  such  it  is,  in  a  cer¬ 
tain  statistical  sense,  the  best  value  to  choose  for  tt  . 
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It  is  evident  from  the  values  of  y  in  Table  3.1  that  Raters  1,  3,  1 

r  ) 

j 

and  4  are  in  reasonably  close  agreement  concerning  choice  of  a  cutting  j 

i 

score,  but  Rater  2  thinks  the  cutting  score  should  be  higher  than  0.80  j 

i 

and  Rater  5  thinks  it  should  be  considerably  lower  than  0.80.  This  j 

I 

—  J 

disagreement  among  raters  is  reflected  in  the  quantity  s(y^)  =  .083.  j 

Such  disagreement  is  not  unusual  and  probably  should  be  expected  because 
even  well-qualified  raters  may  have  different  opinions  about  minimal 
competence  and/or  the  relationships  between  minimal  conpetence  and  the 
items  used  in  the  study.  Indeed,  one  purpose  of  a  cutting  score  study 
is  to  reveal  such  differences  of  opinion  in  a  systematic  and  objective 
manner. 

Variability  in  Study  Cutting  Scores 

For  the  purpose  of  examining  variability  in  y  ,  s (y^)  is  rele¬ 
vant  but  not  actually  the  quantity  of  principal  interest.  Rather, 
one  would  ideally  like  to  know  how  variable  y  would  be  if  the  study  were 
replicated  (under  similar  conditions)  a  large  number  of  times.  Let  us 
describe  this  variability  in  y  in  terms  of  a  standard  deviation  and 
identify  it  as  o(y).  Clearly,  if  o( y)  were  small,  then,  even  if  raters 
disagreed  to  some  extent  concerning  the  cutting  score  resulting  from  a 
single  study,  such  disagreement  would  not  seriously  impact  one's  confi¬ 
dence  in  using  y  as  a  cutting  score.  However,  if  o(y)  were  large,  then 
one  might  want  to  keep  this  fact  in  mind  when  making  decisions  based 


on  y. 
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Even  though  there  is  typically  only  one  cutting  score  study  avail¬ 
able,  it  is  still  possible  to  estimate  the  standard  deviation  of  y  that 
would  result  if  the  study  were  replicated  a  large  number  of  times. 

Table  3.2  reports  three  such  estimates  along  with  their  numerical  values 
for  the  data  in  Table  3.1.  These  three  estimates  are  similar  in  that 
each  of  them  assumes  that  each  (hypothetical)  replicated  study  involves 
a  different  sample  of  t  raters  (t=5  in  Table  3.1)  and  a  different  sample 
of  items.  As  described  below,  the  three  estimates  differ  with  respect 
to  the  number  of  items  involved  in  each  replicated  study:  Equation  3.1 
in  Table  3.2  would  be  appropriate  if  an  investigator  wanted  to  consider 
(hypothetical)  replicated  studies  involving  m  items — the  same  number  of 
items  used  in  the  actual  cutting  score  study.  Under  this  circumstance, 
Table  3.2  shows  that  o(y)  =  .041  for  the  data  in  Table  3.1.  If,  however, 
an  investigator  wanted  a (y)  over  replicated  studies  involving  n  items — 
a  number  different  from  (usually  smaller  than)  m,  then  the  appropriate 
estimate  would  be  obtained  from  Equation  3.2  in  Table  3.  For  example, 
given  the  synthetic  data  and  a  test  length  of  n=10  items,  Table  3.2 
shows  that  o (y)  =  .045. 

A  third  estimate  of  a(y)  is  obtained  by  assuming  that  replicated 
studies  would  each  involve  rating  all  items  in  the  universe.  Under 
this  circumstance,  the  appropriate  estimate  of  o(y)  is  Equation  3.3 
in  Table  3.2;  and  for  the  synthetic  data  a(y)  =  0.036.  This  value 
is  less  than  either  of  the  other  two  estimates  of  o(y)  because  o(y) 
decreases  as  the  number  of  items  increases. 


Standard  deviation  of  y  if  each  of  the 
t  raters  rated  all  items  in  the  universe 


Any  one  of  these  estimates  might  be  of  interest  to  an  investigator 
however,  the  third  estimate  is  especially  relevant  for  many  (if  not  most) 
domain-referenced  testing  situations.  Recall  that  a  cutting  score  is 
defined  as  a  proportion  of  items  correct  for  the  universe  of  items . 

It  follows  that  ideally  one  would  like  to  have  each  rater  rate  every 
item  in  the  universe  to  obtain  each  of  the  "rater  cutting  scores."  It 
is  almost  always  impossible  to  obtain  such  data  directly,  but  even  so 
Equation  3.3  allows  us  to  estimate  a(y)  under  this  circumstance.  This 
equation  is  also  appropriate  if  the  rating  procedure  is  followed  for  all 
items  that  occur  in  each  and  every  form  of  a  domain- referenced  test. 

One  particular  use  of  a(y)  in  Equation  3.3  is  in  establishing  a 
confidence  interval  for  the  cutting  score.  For  example  if  one  goes 
one  standard  deviation  to  the  right  and  left  of  y  ,  then  one  obtains 
a  68%  confidence  interval  for  the  cutting  score  it^.  For  the  synthetic 
data  this  interval  extends  from 

y  -  o(y)  =  .800  -  .036  =  .76 

to  y  +  a(y)  =  .800  +  .036  =  .84, 

and  this  interval  is  represented  (.76,  .84).  In  words,  we  can  say 

that  if  the  cutting  score  study  were  replicated  a  large  number  of  times 

(each  time  using  all  items  in  the  universe) ,  about  68%  of  the  time  we 

would  expect  to  obtain  values  of  y  between  .76  and  .84. 

Given  these  data,  therefore,  in  a  certain  statistical  sense 

y  =  .80  is  the  best  single  number  (proportion  of  items  correct)  to  use  as 

a  cutting  score,  ir  ;  however,  an  investigator  is  well  advised  to  enter- 
o 

tain  some  uncertainty  about  whether  or  not  this  value  for  tt  is  "correct" 

o 

in  some  absolute  sense.  Also,  as  wiJl  be  indicated  in  Section  4 


for  some  purposes ,  procedures  are  available  that  employ  what  is  called 

an  "indifference  zone”  for  the  cutting  score  it  ;  and  the  confidence 

o 

interval  discussed  above  can  be  helpful  in  picking  an  indifference 
zone. 

Other  Considerations 

One  factor  that  can  contribute  greatly  to  differences  among  raters 

in  their  y  values  is  differential  ideas  about  what  constitutes  minimal 
r 

performance.  Any  definition  of  minimal  competence  is  almost  always 
a  matter  of  judgment  (packing  a  parachute  may  be  an  exception!),  but 
very  disparate  notions  about  minimal  competance  can  render  a  cutting 
score  study  of  relatively  little  value.  At  the  same  time,  however, 
the  raters  themselves  should  be  well  qualified  to  define  what  minimal 
competence  is,  or  at  least  to  have  a  voice  in  any  such  definition. 

In  particular,  it  is  very  difficult,  if  not  impossible,  for  raters  to 
participate  in  a  cutting  score  study  using  someone  else's  definition 
of  minimum  competence.  For  these  reasons,  it  is  advised  that  raters 
have  the  opportunity  to  discuss  their  possibly  different  notions  about 
minimal  conf>etence  prior  to  conducting  the  actual  study.  Hopefully, 
they  can  reach  some  consensus  or  at  least  mitigate  their  differences  of 
opinion  in  a  mutually  acceptable  manner. 

Another  issue  to  be  considered  is  the  manner  in  which  items  are 
provided  to  raters--specif ically ,  are  the  answers  provided  along  with 
the  items?  All  things  considered,  it  is  probably  best  that  answers 


be  supplied.  In  doing  so,  one  can  obtain  an  additional  check  on 


the  correctness  of  the  indicated  answers,  and  raters  are  probably 
more  likely  to  pay  careful  attention  to  each  item  individually.  Assum¬ 
ing  that  the  answers  are  supplied,  each  rater  should  be  directed  to 
indicate  any  items  that  he/she  judges  to  be  keyed  incorrectly.  If 
it  is  determined  after  the  raters  complete  their  task  that  an  item  is 
keyed  incorrectly,  it  (and  the  probabilities  assigned  to  it)  should 
be  eliminated  from  the  study,  and  the  item  should  be  revised  or  dis¬ 
carded.  If,  on  the  other  hand,  it  is  determined  after  careful  consid¬ 
eration  that  a  rater  said  an  item  was  keyed  incorrectly,  but  actually 
it  was  keyed  correctly,  then  that  rater's  judgment  (i.e.,  assigned 
probability)  for  that  item  should  be  eliminated  in  determining  y. 

This  can  happen —  each  individual  rater  is  not  infallible,  even  in 
his/her  area  of  expertise. 

Table  3.1  illustrates  the  rather  common  occurrence  of  one  rater 
(in  this  case  Rater  5)  providing  judgments  that  are  markedly  different 
from  the  judgments  provided  by  other  raters.  Even  so  (assuming  all 
raters  were  chosen  carefully  in  the  first  place) ,  an  atypical  rater 
should  not  be  eliminated  from  the  study  unless  there  is  an  obvious 
reason  (e.g.,  sickness)  for  that  rater's  atypical  judgments.  If  such 
a  reason  exists,  then  all  statistics  should  be  re-calculated  based  on  the 
reduced  set  of  raters.  [For  example,  if  Rater  5  were  eliminated  from 
the  synthetic  data,  then  the  reader  can  verify  that  y  =  .835;  sly^)  =  .031; 
and,  using  Equation  3.3,  o(y)  =  .021.] 
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One  modification  of  (or  addition  to)  this  procedure  for  establish¬ 


ing  a  cutting  score  involves  having  the  raters,  as  a  group,  provide 
a  consensus  probability  for  each  item  after  they  have  independently 
provided  their  judgments  about  each  item.  Then  the  mean  of  these  con¬ 
sensus  probabilities  is  used  as  the  cutting  score.  If  this  modification 
is  employed,  the  resulting  data  should  be  examined  very  carefully  to  en¬ 
sure  that  no  single  rater  is  exerting  undue  influence  over  the  judg¬ 
ments  of  other  raters.  (Also,  if  this  modification  is  used  one  should 
keep  in  mind  that  forced  consensus  is  not  really  agreement,  although 
forced  consensus  can  effectively  hide  disagreement.) 


4.  Establishing  an  Advancement  Score 

When  domain-referenced  testing  is  employed  to  make  mastery/non¬ 
mastery  types  of  decisions,  it  is  necessary  to  consider  a  cutting  score, 
tto ;  but,  in  addition,  the  investigator  must  specify  an  observable  score, 

x  ,  such  that  an  examinee  who  gets  x  or  more  items  correct  will  be 
o  o 

declared  a  master;  and  an  examinee  who  gets  fewer  than  x  items  correct 

o 

will  be  declared  a  non-master.  This  score  is  called  an  advancement 

score ,  with  the  symbol  x^  referring  to  the  advancement  score  in  terms 

of  number  of  items  correct  and  (later)  the  symbol  cq  referring  to  the 

advancement  score  in  terms  of  proportion  of  items  correct. 

In  principle,  one  wants  to  pass,  or  advance,  an  examinee  if  that 

examinee's  universe  score,  n  ,  is  equal  to  or  greater  than  the  cutting 

P 

score,  7r  .  However,  one  cannot  directly  use  such  a  decision  rule  be- 
o 

cause  a  specific  domain-referenced  test  will  consist  of  only  a  sample 

of  items  from  the  universe.  Based  on  any  sample  of  items,  an  examinee's 

observed  mean  score,  x  ,  can  be  calculated,  but  not  the  examinee's  uni- 

P 

verse  score,  ir  .  Furthermore,  the  cutting  score,  ,  may  not  correspond 
with  a  possible  observed  mean  score  for  test  of  n  items.  (For  example, 
if  n  =  10,  then  no  proportion  of  items  correct  will  correspond  with  a 
cutting  score  of  .85.) 

Let  us  suppose  that,  as  a  result  of  some  cutting  score  study,  it 

o 

is  specified  to  be  .80,  and  let  us  assume  that  a  test  will  consist 
of  n  =  10  items.  Since  .80  x  10  =  8,  an  investigator  might  decide  that 
the  advancement  score  should  be: 
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x  =  8  in  terms  of  number  of  items  correct;  or 

o 

c  =  x  /n 
o  o 

=  8/10 

=  .80  in  terms  of  proportion  of  items  correct. 

In  this  example,  choosing  x  to  be  eight  items  correct  may  appear  rea- 

o  ' 

sonable  and,  indeed,  this  particular  advancement  score  may  be  a  good 
choice  in  some  particular  context.  However,  the  "logic"  presented  above 
for  choosing  an  advancement  score  is  rather  superficial.  For  example, 
this  logic  does  not  take  into  account  the  fact  that  an  observed  score 
may  be,  and  usually  is,  different  from  a  universe  score.  As  will  be¬ 
come  evident  later,  a  more  thorough  analysis  could  lead  to  choosing 

some  advancement  score  other  than  x  =8. 

o 

The  purpose  of  this  section  is  to  provide  a  reasonably  sound, 
yet  relatively  simple,  table- look-up  procedure  for  choosing  an  advance¬ 
ment  score.  Even  though  this  procedure  is  quite  simple  compared 
to  others  that  might  be  used,  it  does  involve  consideration  of  several 
technical  issues.  Specifically,  to  use  this  procedure,  one  must  first 
specify  a  test  length,  a  loss  ratio,  and  an  indifference  zone.  These 
issues  are  discussed  below,  followed  by  an  illustration  of  how  to  use 
the  table  look-up-procedure. 
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Related  Issues 

Sometimes,  choosing  a  test  length  (n)  is  a  more  difficult  problem 
than  it  may  appear  to  be  at  first  glance.  All  other  things  being  equal, 
longer  tests  are  to  be  preferred  over  shorter  tests,  because  longer 
tests  reduce  certain  types  of  errors  (discussed  more  fully  later) . 

Also,  longer  tests  are  more  valid  in  the  sense  that  they  provide  a  more 
thorough  representation  of  the  intended  universe  of  items.  At  the  same 
time,  however,  in  domain-referenced  testing  environments,  factors  such 
as  available  testing  time  frequently  make  it  very  difficult  and/or  costly 
to  use  tests  that  are  very  long.  For  now,  it  will  be  assumed  that  there 
already  exists  some  reasonable  basis  for  choosing  a  particular  test 
length,  at  least  for  the  initial  form(s)  of  a  domain-referenced  test. 

In  subsequent  sections,  as  different  concepts  and  procedures  are  devel¬ 
oped,  it  will  be  possible  to  identify  some  reasonable  statistics  to 
consider  in  choosing,  or  modifying,  test  length. 

Classification  errors  and  loss  ratio.  The  concept  of  a  loss  ratio 
involves  a  consideration  of  errors  that  can  be  made  in  classifying 
an  examinee  as  a  passing  examinee  (master)  or  a  failing  examineee  (non¬ 
master)  .  Specifically,  there  are  two  classification  errors  that  can 
be  made: 

(a)  a  false  positive  error  occurs  if  an  examinee  is  declared  a  master 

(i.e.,  advanced)  who  has  a  universe  score  below  n  ;  and 

-  o 

(b)  a  false  negative  error  occurs  if  an  examinee  is  declared  a  non¬ 


master  (i.e.,  not  advanced)  who  has  a  universe  score  above  it  . 
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These  two  classification  errors  are  considered  more  fully  in  Section 
5  in  the  context  of  decisions  about  individual  examinees.  Here,  our 
concern  is  with  a  certain  kind  of  judgment  about  false  positive  and 
false  negative  errors.  Specifically,  in  this  handbook  the  term  "loss 
ratio"  refers  to  a  number  reflecting  judgment  about  the  seriousness 
of  a  false  positive  error  compared  to  the  seriousness  of  a  false  nega¬ 
tive  error.  For  example,  if  false  positive  errors  were  judged  to  be 
twice  as  serious  as  false  negative  errors,  then  the  loss  ratio  would  be 
two;  and,  if  both  types  of  classification  errors  were  equally  serious, 
then  the  loss  ratio  would  be  one. 

By  definition,  the  specification  of  a  loss  ratio  involves  sub¬ 
jective  judgment  on  the  part  of  a  person  (or  persons)  intimately  famil¬ 
iar  with  the  testing  context.  In  making  this  judgment  one  needs  to 
consider  the  consequences  of  inappropriately  passing  or  inappropriately 
failing  an  examinee.  For  example,  in  many  domain-referenced  testing 
contexts,  it  is  frequently  argued  that  an  examinee  who  is  inappropri¬ 
ately  advanced  (false  positive  error)  is  likely  to  be  unsuccessful 
on-the-job  or  in  subsequent  instruction;  and,  this  type  of  error  is 
judged  more  serious  than  the  time  and  cost  involved  in  inappropriately 
re-cycling  an  examinee  through  an  instructional  sequence  (false  nega¬ 
tive  error).  These  particular  judgments  suggest  that  a  loss  ratio,  in 
such  contexts,  should  be  defined  as  some  number  greater  than  one — perhaps 
two,  but  probably  not  three  unless  instructional  time  and  cost  are  quite 


unimportant. 
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Indifference  zone.  An  indifference  zone  is  some  range  of  universe 

scores  within  which  one  is  "indifferent"  about  false  positive  and  false 

negative  errors.  Let  us  identify  the  lower  limit  of  this  range  as  n  , 

L 

the  upper  limit  as  it  ,  and  the  range  itself  as  (it.  ,  it)  .  Suppose 

H  L  H 

an  investigator  is  able  to  specify  values  for  it  and  such  that, 

L  H 

for  any  examinee  whose  universe  score  is  between  it  and  u  ,  there  is 

L  H 

virtually  no  loss  involved  in  declaring  a  true  master  to  be  a  non-master 
<?r  in  declaring  a  true  non-master  to  be  a  master.  In  such  a  case  the 
interval  (it  ,  it  )  may  be  viewed  as  an  indifference  zone.  This  rather 

Lt  n 

direct  approach  to  defining  an  indifference  zone  may  or  may  not  make 
sense  in  a  particular  context. 

Another  approach  involves  the  procedure  for  establishing  a  cutting 

score  discussed  in  Section  3.  Specifically,  consider  again  o  (y)  in 

Equation  3.3,  which  is  the  standard  deviation  of  y  over  replicated 

studies,  if  each  study  involved  all  the  items  in  the  universe.  It  was 

stated  in  Section  3  that  y  can  serve  as  v  and  a  68%  confidence  inter- 

o 

val  for  can  be  viewed  as  extending  from  y  -  a (y)  to  y  +  a(y),  approx¬ 
imately.  This  confidence  interval  (or  something  close  to  it)  might 
be  viewed  as  an  indifference  zone.  Consider,  for  example  the  synthetic 
data  treated  in  Section  2.  For  these  data,  y  =  .80;  using  Equation 
3.3,  o(y)  =  .036;  and  the  68%  confidence  interval  is  (.76  to  .84). 

Since  this  interval  indicates  a  degree  of  uncertainty  about  some  "ideal" 
value  for  a  cutting  score,  it  seems  reasonable  to  assume  that  an  investi¬ 
gator  might  have  little  basis  for  being  anything  but  indifferent  about 
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classification  errors  for  examinees  whose  universe  scores  lie  in  the 
interval  (.76  to  .84). 

In  considering  either  of  the  above  approaches  to  establishing  an 
indifference  zone,  it  needs  to  be  recognized  that  these  procedures 
are  not  to  be  viewed  as  statistical  excuses  for  being  indifferent,  in 
the  sense  of  uncaring,  about  individual  examinees  who  have  observed  mean 
scores  close  to  it  .  Rather,  these  procedures  are  to  be  viewed  as  aids 
in  the  process  of  establishing  an  indifference  zone,  which  is  a  neces¬ 
sary  consideration  for  picking  an  advancement  score  using  the  table 
discussed  below. 

Advancement  Score  Table 

Given  a  test  length,  a  loss  ratio,  and  an  indifference  zone.  Table 

A. 1  provides  a  specific  advancement  score,  x  ,  in  terms  of  number  of 

o 

items  correct.  (To  obtain  the  advancement  score  in  terms  of  proportion 

of  items  correct,  one  simply  uses  the  relationship  c  =  x  /n.)  The  rows 

o  o 

of  Table  A.l  are  associated  with  different  test  lengths,  ranging  from 
6  to  30  items;  and  the  columns  are  associated  with  20  indifference  zones, 
organized  according  to  the  mid-points  of  the  zones ,  with  mid-points 
ranging  from  .65  to  .90.  For  each  row  and  column,  there  are  three 
tabled  entries  (separated  by  slashes)  corresponding  to  advancement 
scores  associated  with  loss  ratios  of  1,  2,  and  3,  respectively. 

To  illustrate  use  of  Table  A.l,  let  us  consider  the  following 
judgments  about  test  length,  loss  ratio,  and  indifference  zone: 
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(a)  Test  length .  Let  us  assume  that  testing  time  is  at  a  premium, 
and  the  universe  of  items  is  rather  narrow.  Taking  these  two  considera¬ 
tions  into  account,  it  is  judged  that  about  n  =  10  test  items  seems 
reasonable, 

(b)  Loss  ratio.  Let  us  assume  that  the  domain-referenced  testing 
context  is  one  in  which  false  positive  errors  are  judged  to  be  somewhat 
more  serious  than  false  negative  errors,  and  a  loss  ratio  of  about  two 
seems  reasonable. 

(c)  Indifference  zone .  Let  us  suppose  that  it  is  decided  to  use 
the  results  of  a  cutting  score  study  in  making  judgemnts  about  an  indif¬ 
ference  zone.  Specifically,  let  us  suppose  that  the  results  reported 

in  Section  2  are  based  on  the  appropriate  universe  of  items.  This  study 

suggests  that  an  approximate  68%  confidence  interval  for  it  is  (.76  to  .84) 

o 

and  it  will  be  assumed  that  this  confidence  interval  can  serve  as  an 
approximate  indifference  zone. 

Now,  given  the  above  judgements,  to  pick  an  advancement  score, 
one  uses  the  fifth  row  (n  =  10)  and  second  column (.75  to  .85)  of  the 
second  page  of  Table  A.l.  The  tabled  entries  corresponding  to  this 
row  and  column  are  9/9/9.  Since  all  of  these  entries  are  the  same 

number,  it  is  obvious  that  the  advancement  score  is  x  =  9  or  c  =  9/10 

o  o 

=  .90.  To  be  specific,  since  the  loss  ratio  has  been  defined  as  two, 

the  second  entry  is  actually  the  advancement  score  for  this  illustratioh. 

In  the  above  example,  note  that  the  indifference  zone  (.75  to  .85) 
specified  in  the  second  column  of  the  second  page  of  Table  A.l  is  not 


exactly  equal  to  the  indifference  zone  of  (.76  to  .84),  which  was  ini¬ 
tially  chosen.  Any  such  slight  disparity  can  be  overlooked  without 
serious  consequences,  because,  for  the  most  part,  the  procedure  used 
to  develop  Table  A.l  is  insensitive  to  small  disparities  in  indifference 
zones.  Furthermore,  it  is  not  necessary  that  be  exactly  at  the 
midpoint  of  the  indifference  zone.  Indeed,  for  reasons  beyond  the 

scope  of  this  handbook,  it  is  sufficient  that  ir  be  somewhere  within 

o  - 

the  indifference  zone. 

Table  A.l  indicates  (and  the  above  example  illustrates)  that  this 
procedure  for  choosing  an  advancement  score  is  also  relatively  insen¬ 
sitive  to  small  changes  in  loss  ratio.  Indeed,  for  any  specific  test 
length  and  indifference  zone  in  Table  A.l,  the  suggested  advancement 
scores  differ  by  at  most,  one  correct  item. 

The  above  points  about  "insensitivity"  have  been  made  to  highlight 
the  fact  that  this  procedure  for  choosing  an  advancement  score  does  not 
necessitate  arguing  about  minute  differences  of  opinion  with  respect  to 
an  appropriate  indifference  zone  or  loss  ratio — a  reasoned  consideration 
of  these  issues  is  sufficient  for  the  procedure. 


43 


5.  Errors  of  Measurement , 

Errors  of  Classification,  and 
Inferences  about  an  Examinee 1 s  Universe  Score 

Sections  2,  3,  and  4  have  considered  issues  that  are  addressed  prior 

to  making  any  decision  about  an  examinee.  Let  us  now  assume  that  the 

issues  discussed  in  Sections  2,  3,  and  4  have  been  addressed,  a  domain- 

referenced  test  of  ri  items  has  been  administered  to  a  group  of  examinees, 

and  each  examinee's  score  on  the  test  has  been  determined.  In  this  section, 

consideration  is  given  to  the  precision,  or  quality,  of  certain  statements, 

or  decisions,  that  might  be  made  about  an  examinee.  To  address  these 

issues,  the  only  examinee  datum  that  will  be  employed  is  the  examinee's 

test  score.  To  simplify  notation  in  this  section,  usually  the  examinee's 

number  of  items  correct  will  be  denoted  x,  the  examinee's  proportion  of 

items  correct  will  be  denoted  x  (rather  than  x  ),  and  the  examinee's 

P 

universe  score  will  be  denoted  ft  (rather  than  tt  )  .  j 

p  _  j 

It  cannot  be  emphasized  enough  that  n  is  always  unknown,  and  x  is 
only  an  estimate  of  tt.  Consequently,  there  is  always  some  degree  of 
uncertainty  about  any  statement  concerning  tt.  For  example,  if  x  =  .80, 

! 

one  may  say  that  it  is  "about"  .80,  but  this  statement  clearly  suggests 

that  it  and  x  may  be  different,  and  perhaps  dramatically  different.  i 

This  difference  between  x  and  tt  is  called  an  error  of  measurement . 

Furthermore,  since  x  is  an  imperfect  estimate  of  tt  ,  mastery/non¬ 
mastery  decisions  based  on  x  (or  x)  may  be  incorrect,  and  an  error  of 


i 


classification  may  be  made.  This  issue  was  introduced  in  the  previous 
section  in  the  context  of  specifying  a  loss  ratio.  In  this  section, 
errors  of  classification  are  considered  in  more  detail,  from  the  perspec¬ 
tive  of  decisions  about  examinees. 

It  needs  to  be  recognized  that,  since  it  is  unknown,  one  cannot 
specify  whether  or  not  a  classif ication  error  has  been  made  for  an 
individual  examinee;  nor,  can  one  specify  a  particular  value  for  an 
individual  examinee's  error  of  measurement.  However,  given  n  and  x 
(or  x) ,  it  is  possible  to  make  statements  about  the  probability  of 
correct  and  incorrect  decisions,  and  about  likely  values  of  it .  Pro¬ 
cedures  for  doing  so  are  described  and  illustrated  in  this  section, 
after  a  more  detailed  consideration  of  errors  of  measurement  and  clas¬ 


sification 


Errors  of  Measurement  and  Classification. 

Recall  that  an  examinee's  universe  score  is  the  porportion  of  items, 
tt,  that  the  examinee  would  get  correct  if  the  examinee  were  administered 
all  items  in  the  universe.  Suppose  an  examinee  takes  a  domain-referenced 
test  with  n  =  10  items  and  gets  x  =  8  items  correct.  It  should  be  intui¬ 
tively  obvious  that  this  does  not  necessarily  mean  that  the  examinee's 
universe  score  is  x  =  x/n  =  8/10  =  .80.  After  all,  the  examinee  was  tested 
with  10  items,  only;  and  it  is  to  be  expected  that  x  =  .80  is  an  imperfect 
estimate  of  the  examinee's  universe  score.  This  imperfection  in  measure¬ 
ment  is  called  measurement  error.  Specifically,  measurement  error  is  the 
difference  between  an  exmaminee's  test  score  (expressed  as  a  proportion  of 
items  correct,  x)  and  the  examinee's  universe  score: 

A  =  X  -  IT. 

Note  the  use  of  the  symbol  A  to  designate  measurement  error.  Clearly, 

A  can  be  either  positive  or  negative,  as  well  as  being  either  large  or 
small. 

It  is  evident  from  the  definition  of  A  that  a  cutting  score,  tt  , 

o 

plays  no  role  in  considerations  regarding  error  of  measurement.  However, 
for  mastery/non-mastery  decisions  a  cutting  score,  n  ,  ij5  involved;  and  for 
such  decisions,  an  error  of  classification  may  be  made  in  addition  to  an 
error  of  measurement.  As  noted  in  Section  4,  there  are  two  types  of  errors 


of  classification: 


(a)  a  false  positive  error  (f+)  occurs  if  an  examinee  is  declared  a 

master  (x  >  x  )  when  the  examinee's  universe  score  is  below  it  ;  and 
-  —  o  -  o 

(b)  a  false  negative  error  (f-)  occurs  if  an  examinee  is  declared 

a  non-master  (x  <  x  )  when  the  examinee ' s  universe  score  is  at  or  above  ir  . 
- 0  -  0 

These  two  possible  errors  of  classification  are  represented  in  Table  5.1 

along  with  the  two  possible  correct  decisions — namely,  passing  an  examinee 

who  has  a  universe  score  at  or  above  n (c+) ,  and  failing  an  examinee  who 

has  a  universe  score  below  n  (c-) . 

'  ° 

To  better  appreciate  errors  of  measurement  and  classification,  consider 

j 

Figure  5.1  in  whidh  it  is  assumed  that  it  =  .80,  n  =  10,  and  c  =  .90.  For 
«  o  o 

12  pairs  of  values  for  x  and  tt,  Figure  5.1  represents  the  resulting  error 
ol  measurement  and  error  of  classification  or  correct  decision.  As  illus¬ 
trated  in  Figure  5.1: 

(a)  a  false  positive  decision  implies  that  a  positive  error  of  measure¬ 
ment  (x  >  it)  is  involved  (see  lines  G,  H,  and  I  in  Figure  5.1)  ; 

(b)  a  false  negative  decision  implies  that  a  negative  error  of  mea¬ 
surement  (x  <  it)  is  involved  (see  lines  J,  K,  and  L  in  Figure  5.1);  and 

(c)  even  when  a  correct  (positive  or  negative)  decision  is  made”,  an 
error  of  measurement  (positive  or  negative)  may  be  involved  (see  lines  A-F 
in  Figure  5.1). 

In  short,  the  occurrence  of  an  error  of  measurement  does  not  neces¬ 
sarily  mean  that  an  error  of  classification  will  be  made;  however,  an  error 
of  classification  is  always  associated  with  an  error  of  measurement,  and 
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Table  5.1 


Correct  Mastery/Non-Mastery  Decisions  and 
Errors  of  Classification 


Observed 

Score 

Universe 

Score 

TT  <  IT 

O 

7T  >  7T 

O 

X  <  X 

o 

Correct  Negative 

False  Negative 

(Fail) 

Decision  (c-) 

Error  (f-) 

X  >  X 

False  Positive 

Correct  Positive 

—  o 

(Pass) 

Error  (f+) 

Decision  (c+) 

Note.  The  symbol  >  means  "greater  than,”  the  symbol  >  means 
"greater  than  or  equal  to,"  the  symbol  <  means  "less  than,"  and 
the  symbol  _<  means  "less  than  or  equal  to. " 


I 


Illustration  of  Errors  of  Classification  and  Errors  of  Measurement 
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frequently  a  rather  large  error  of  measurement.  Indeed,  errors  of  clas¬ 
sification  arise  because  errors  of  measurement  are  involved.  This  is  one 
reason  why  it  is  highly  advisable  to  pay  attention  to  issues  surrounding 
errors  of  measurement — even  if  the  principal  focus  of  domain-referenced 
testing  is  mastery/non-mastery  decisions. 

It  should  be  noted  also  that,  if  an  error  of  classification  is  made, 
it  is  not  correct  to  describe  the  error  of  classification  as  being  either 
large  or  small — such  an  error  is  either  made  or  it  is  not  made,  nothing 
more.  For  example  lines  G  and  I  in  Figure  5.1  both  represent  false  posi¬ 
tive  classifications  errors,  and  line  G  does  not  represent  a  larger  clas¬ 
sification  error  than  line  I.  Rather,  line  G  represents  a  larger  error 
of  measurement  than  line  I. 

It  needs  to  be  recognized  that,  since  an  individual  examinee's  uni¬ 
verse  score  is  unknown,  we  cannot  directly  determine  the  error  of  measure¬ 
ment  for  an  individual  examinee.  For  the  same  reason,  it  is  impossible 
to  say,  for  certain,  whether  or  not  a  classification  error  has  been  made 
for  an  individual  examinee.  However,  given  n  and  x  (or  x)  it  is  possible 
to  make  statements  about:  (a)  probabilities  associated  with  correct  and 
incorrect  decisions,-  and  (b)  likely  values  for  it.  Procedures  for  doing  so 
are  treated  in  the  next  two  parts  of  this  section. 

Probabilities  of  Correct  and  Incorrect  Decisions 

Since  one  cannot  say,  for  certain ,  whether  or  not  a  classification 
error  has  been  made  for  an  individual  examinee,  it  is  reasonable  to  ask, 
"How  probable  is  it  that  an  examinee  with  a  score  of  x  (or  x)  on  an  n-item 
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test  has  been  misclassfied?"  Technically,  there  are  many  answers  to  this 
question,  depending  on  the  assumptions  one  is  willing  to  make.  The  approach 
taken  here  to  answering  this  question  involves  using  Table  A. 2  which  was 
developed  under  very  simple  assumptions  (see  Appendix  B) .  Roughly,  speak¬ 
ing,  these  assumptions  imply  that  all  we  know  about  an  examinee  is  the 
examinee's  test  score,  and  the  fact  that  the  examinee  took  a  test  consist¬ 
ing  of  a  sample  of  n  items  from  a  large  universe  of  items. 

Table  5.2  provides  a  step-by-step  procedure,  with  examples,  for  deter¬ 
mining  probabilities  associated  with  correct  and  incorrect  decisions. 

This  procedure  involves  nothing  more  complicated  than  identifying  an  entry 
in  Table  A. 2  and  possibly  subtracting  it  from  100.  Note  that,  in  this 
handbook,  a  probability  is  usually  identified  and  discussed  as  a  percent 
ranging  from  0  to  100.  This  convention  has  been  adopted  to  avoid  confus¬ 
ing  a  statement  about  a  probability  with  a  statement  about  an  examinee's 
universe  score  (ir)  or  observed  mean  score  (x)  ,  both  of  which  range  from 
0  to  1. 

It  is  suggested  that,  whenever  mastery/non-mastery  decisions  are  to 
be  made,  the  investigator  examine  the  probabilities  in  Table  5.2 — at  least 


the  probabilities  of 

incorrect 

decisions 

for  examinees  near  the  cutting 

score.  For  example, 

using  the  procedure 

in 

Table 

5.2  with  n  =  10, 

n  =  .80,  and  c  =  . 
o  o 

90, 

Prob  ( f- ) 

= 

5% 

if 

X 

=  6, 

Prob  (f-) 

= 

16% 

if 

X 

=  7, 

Prob  <f-) 

= 

38% 

if 

X 

=  8, 

Prob  (f+) 

= 

32% 

if 

X 

=  9, 

and 

Prob  (f+) 

= 

9% 

if 

X 

=  10. 

Use  of  Table  A.  2  to  Determine  Probabilities  of  Correct  and  Incorrect  Classification  Decisions 
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Probability  of  a  False  Positive  Decision: 
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Given  such  results,  an  investigator  might  decide  that,  when  x  =  8  or  9 
the  probability  of  an  incorrect  decision  is  unacceptably  large.  If  so, 
the  investigator  might  consider  retesting  examinees  with  scores  of  8  or  9 
using  a  different  sample  of  items. 

Suppose,  for  example  that  an  examinee  got  8  out  of  10  items  correct, 

initially,  and  10  out  of  10  items  correct  on  a  retest.  The  cutting 

score  is  still  tt  =  .80;  but,  over  both  tests, 
o 

x  =  8  +  10  =  18  and  n  =  10  +  10  =  20. 

To  make  a  decision  about  this  examinee,  the  investigator  must  recognize 
that  the  effective  test  length  for  this  examinee  is  now  n  =  20;  and,  con¬ 
sequently,  a  new  value  for  the  advancement  score,  x  ,  must  be  determined 

o 

using  the  procedure  discussed  in  Section  4.  Suppose  that  x  turns  out  to 

o 

be  17  (which  is  the  value  of  x  when  the  loss  ratio  is  two  and  the  indif- 

o 

ference  zone  is  .75  to  .85).  Since  x  =  18  is  greater  than  x  =17,  the 

o 

examinee  should  be  advanced;  and  Table  A. 2  indicates  that,  under  these 
circumstances,  the  probability  of  a  false  positive  error  is  18%. 

The  probabilities  of  correct  and  incorrect  decisions  resulting  from  the 
procedure  outlined  in  Table  5.2  do  not  depend  on  having  examinee  scores 
on  a  specific  test;  rather,  these  probabilities  are  for  any  test  consis¬ 
ting  of  a  sample  of  10  items  from  a  very  large  universe.  It  follows  that 
an  investigator  might  consider  making  a  decision  about  test  length  based 
on  an  examination  of  probabilities  of  incorrect  decisions,  for  tests  of 
different  length.  In  Section  6  a  closely  related  issue  is  treated  in 


detail. 
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Intervals  for  an  Examinee 1 s  Universe  Score 

Even  though  it  is  impossible  to  specify  a  numerical  value  for  error 
of  measurement  for  an  individual  examinee,  it  is  possible  to  make  state¬ 
ments  about  probable  values  of  it,  given  n  and  x  (or  x) .  More  specifi¬ 
cally,  it  is  possible  to  determine: 

(a)  the  probability  that  it  is  between  two  particular  values 
(it^  and  ir^)  specified  by  the  investigator;  and 

(b)  an  interval  (or  range  of  values)  for  ir  such  that  the  investi¬ 
gator  can  say  with  P%  certainty  that  the  examinee's  universe 
score  is  within  the  interval. 

A  procedure  for  determining  the  probability  referenced  in  (a) ,  above 
and  the  interval  referenced  in  (b) ,  above,  are  provided  in  Table  5.3. 

To  be  techically  correct,  we  should  not  speak  about  the  probability  or  the 
interval  because  there  are  many  such  probabilities  and  intervals,  depend¬ 
ing  on  the  assumptions  one  is  willing  to  make.  Since  the  procedures  out¬ 
lined  in  Table  5.3  involve  a  simple  application  of  Table  A. 2,  the  assump¬ 
tions  for  this  procedure  are  those  involved  in  generating  Table  A. 2  (see 
previous  discussion  of  Table  A. 2  and  Appendix  B) . 

It  should  be  noted  that  (a)  and  (b) ,  above,  answer  different  questions 
Specifically,  (a)  answers  the  question: 

"Given  n,  x,  and  two  investigator-specified  values  ( tt ^  and  ir^)  , 
what  is  the  probability  that  it  is  between  rr^  and 

For  example,  using  the  procedure  in  Table  5.3,  when  n  =  10,  x  =  8, 

=  .75,  and  w  =  .85,  there  is  a  32%  probability  that  ir  is  between  .75 


and  .85. 


By  contrast,  (b)  answers  the  question: 

"Given  n,  x,  and  some  desired  degree  of  certainty,  (P%) ,  what 

is  a  range  of  values  which  probably  includes  it?" 

For  example,  given  n  =  10  and  x  =  8,  Table  A. 2  reports  that: 

(1)  with  67%  certainty  tt  is  between  .67  and  .90; 

(2)  with  80%  certainty  it  is  between  .62  and  .92;  and 

(3)  with  90%  certainty  it  is  between  .56  and  .94. 

Note  that  if  one  wants  to  have  a  greater  degree  of  certainty  about  the 

range  within  which  an  examinee's  universe  score  probably  lies,  then  one 
must  tolerate  a  wider  interval.  For  example,  the  interval  (.56,  .94)  for 
90%  certainty  is  quite  a  bit  wider  than  the  interval  (.67,  .90)  for  67% 
certainty. 

Also,  given  x  and  some  desired  degree  of  certainty,  the  width  of  an 
interval  decreases  as  n  increases.  For  example,  given  n  =  20  and  x  =  16, 
x  =  .80  and  from  Table  A. 2  a  67%  interval  is  (.71,  .87).  This  interval 
is  shorter  than  the  corresponding  interval  (.67,  .90)  for  n  =  10  and  x  =  8. 
In  this  sense  one  can  say  that  long  tests  are  better  than  short  tests,  or, 
more  specifically,  longer  tests  are  generally  associated  with  a  smaller 
average  error  of  measurement  for  examinees.  This  issue  of  test  length 
and  its  relationship  with  errors  of  measurement  is  treated  in  detail  in 
Section  6. 

The  intervals  reported  in  Table  A. 2  are  sometimes  described  as  cred¬ 
ibility  intervals.  Specifically,  Table  A. 2  reports  67,  80,  and  90  percent 
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credibility  intervals  associated  with  observed  mean  scores  of  x  .50, 
for  test  lengths  ranging  from  5  to  30  items.  Similar  results  can  be  ob¬ 
tained  for  other  intervals,  other  test  lengths,  and/or  other  observed  mean 
scores  using  the  procedure  outlined  in  Table  5.4.  Actually,  an  interval 
obtained  using  the  procedure  in  Table  5.4  is  called  a  confidence  interval 
rather  than  credibility  interval,  and  the  interpretation  of  a  confidence 
interval  is  slightly  different  from  the  interpretation  of  a  credibility 
interval.  However,  for  most  practical  purposes  they  can  be  interpreted 
in  about  the  same  way. 

As  indicated  by  the  example  in  Table  5.4,  one  can  say  with  about 
6£  percent  confidence  that  an  examinee  with  an  observed  mean  score  of 
.75  on  a  20-item  test  probabily  has  a  universe  score  between  .65  and  .85. 

By  comparison,  consider  the  "corresponding"  67%  credibility  interval  provided 
in  Table  A. 2.  This  credibility  interval  extends  from  .65  to  .83.  Clearly,  the 
two  intervals  are  quite  close,  but  not  exactly  the  same.  In  general,  it 
is  recommended  that  the  credibility  intervals  in  Table  A. 2  be  used  when¬ 
ever  possible,  and  that  the  procedure  in  Table  5.4  be  used  when  Table  A. 2 
does  not  apply.  For  example.  Table  A. 2  does  not  provide  95  percent  inter¬ 
vals,  but  the  procedure  in  Table  5.4  can  be  used  to  obtain  such  intervals. 
(Note,  however,  that  the  procedure  in  Table  5.4  does  not  apply  if 

x  =  0  or  1;  and  this  procedure  involves  a  normality  assumption  that 
P 

becomes  less  tenable  as  x  approaches  either  0  or  1.) 

P 

In  this  author's  opinion,  in  domain-referenced  testing,  it  is  usually 
advisable  to  determine  credibility  or  confidence  intervals  for  examinee 


universe  scores--at  least  those  examinees  about  whom  important  decisions 
are  to  be  made.  If  nothing  else,  such  intervals  are  usually  very  reveal¬ 
ing  indicators  of  the  amount  of  measurement  error  possibly  involved  in 
using  x  as  if  it  were  it.  If  an  investigator  feels  that  a  specific  inter¬ 
val  is  too  broad  for  a  specific  decision,  then  the  investigator  might  con¬ 
sider  retesting  the  examinee. 

Suppose,  for  example,  that  an  examinee  got  8  out  of  10  items  correct, 
initially,  with  a  67%  credibility  interval  for  it  extending  from  .67  to  90. 
If  the  examinee  were  retested  and  got  10  out  of  10  items  correct,  then  for 
the  combined  tests  n  =  20,  x  =  18,  and  a  67%  credibility  interval  extends 
from  .82  to  95.  This  latter  interval  is  considerably  narrower  than  the 
former  one;  and,  of  course,  the  additional  information  supplied  by  the 
retest  suggests  that  the  examinee's  universe  score  is  probably  higher 


than  originally  expected 


<>. 


Group- Based  Gael' f ici ont s  ol  Agreement  and 
Measures  of  Error 

Section  5  considered  errors  of  measurement  and  errors  of  classifi¬ 
cation  based  on  an  individual  examinee's  score  on  a  test.  This  section, 
considers  issues  involving  group  performance  on  a  test.  Specifically, 
the  principal  statistics  to  be  discussed  are  indicated  in  Table  6.1. 

The  statistics  1  -  p  and  a2  (A)  in  Table  6.1  are  closely  related 
to  errors  of  classification  and  errors  of  measurement,  respectively. 
Specifically,  1  -  p  can  be  interpreted  as  the  probability  of  an  incon¬ 
sistent  decision;  and  a2 (A)  can  be  interpreted  as  the  average  value  of 
the  squared  errors  of  measurement  for  examinees.  As  such,  these  statis¬ 
tics  provide  information  about  errors  for  a  group  of  examinees ,  as  opposed 
to  an  individual  examinee. 

The  other  statistics  in  Table  6.1  are  called  agreement  coefficients 
in  this  handbook.  Each  of  them  has  a  value  somewhere  between  0  and  1, 
with  higher  values  indicating  greater  degrees  of  agreement  than  lower 
values.  The  notion  of  "agreement"  reflected  by  these  coefficients  in¬ 
volves  considering  what  would  happen  (hypothetically)  if_  examinees  were 
administered  many  domain-referenced  tests,  with  each  test  consisting  of 
a  different  sample  of  ri  items  from  the  universe.  For  a  given  test  length 
(n) ,  a  high  value  for  an  agreement  coefficient  suggests  that  there  would 
be  a  high  degree  of  consistency  in  certain  scores  on  these  different 
tests.  For  example,  i_f  we  knew  that  most  persons  classified  as  masters 
on  one  test  would  be  classified  as  masters  on  most  other  tests,  too, 
then  one  type  of  agreement  would  be  relatively  high.  Although  the  above 
conceptual  explanation  of  agreement  coefficients  rests  on  considering 
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Table  6.1 

Loss  Functions,  Agreement  Coefficients,  and  Errors 
Based  on  Group  Performance  on  a  Test 


Type  _ Agreement  Coefficients _ 

of  Not  Corrected  Corrected 

Loss  For  Chance  For  Chance  Errors 


Threshold  Kappa  1  “  PQ 

Squared  Error  $(cq)  ♦  a2  (A) 
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multiple  tests,  in  practice  these  coefficients  can  be  estimated  using  a 
single  test,  only;  and  in  this  handbook  such  single-test  estimates  are 
the  only  ones  given  detailed  consideration. 

The  statistics  in  Table  6.1  can  be  classified  into  two  categories 
based  on  the  type  of  loss  function  involved  in  defining  them.  These 
two  loss  functions  are  called  "threshold"  loss  and  "squared  error"  loss. 
The  subject  of  loss  functions,  per  se,  is  a  highly  technical  consider¬ 
ation  that  will  not  be  treated  in  great  detail  here.  For  present  pur¬ 
poses,  it  is  sufficient  to  know  that  (a)  a  threshold  loss  function 
involves  consideration  of  errors  of  classification,  assumes  that  all  false 
positive  errors  are  equally  serious,  and  assumes  that  all  false  negative 
errors  are  equally  serious;  and  (b)  a  squared  error  loss  function  in 
domain-referenced  testing  involves  consideration  of  errors  of  measurement 
and  assumes  that  the  seriousness  of  an  error  depends  on  (among  other 
things)  the  squared  distance  between  an  examinee’s  observed  and  universe 
scores.  Later,  more  will  be  said  about  these  two  loss  functions;  for  now 
the  reader  should  simply  recognize  that  these  two  loss  functions  involve 
different  approaches  to  addressing  similar  types  of  issues. 

To  develop  some  further  understanding  of  the  statistics  in  Table  6.1, 

suppose  that  test  scores  were  available  for  a  group  of  examinees  on  two 

forms  of  a  domain- referenced  test.  Under  this  circumstance,  the  threshold 

loss  coefficient  denoted  p  in  Table  6.1  would  be 

o 

■ 

Proportion  of  examinees  classified  as 
masters  on  both  forms 


+ 


The  coefficient 


Proportion  of  examinees  classified 
as  non- masters  on  both  forms 

p  is,  in  effect,  the  proportion  of  examinees  consistently 


classified  into  the  same  category  (mastery  or  non-mastery)  on  the  two  tests. 
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It  follows  from  the  above  paragraph  that  1  -  p  is  the  proportion 

o 

of  examinees  who  are  inconsistently  classified  on  the  two  tests  (i.e., 
classified  as  a  master  on  one  form  and  a  non-master  on  the  other) .  This 
proportion  of  inconsistent  classifications  is  a  group-based  measure  of 
error  in  a  threshold  loss  sense,  when  scores  on  two  tests  are  available. 

The  threshold  loss  coefficient  p  is  not  corrected  for  the  expected 

o  - 

"chance"  agreement  if  all  examinees  were  randomly  assigned  to  a  mastery 
or  non-mastery  status  on  each  of  the  forms.  The  threshold- loss  coefficient 
corrected  for  such  chance  agreement  is  called  Kappa,  which  is  defined  as: 

Kappa  =  (p  -  p  )/(l  -  p  ), 
o  c  c 

where  p^  is  chance  agreement.  In  a  sense,  Kappa  is  a  "pure"  measure  of 
agreement  attributable  to  the  testing  procedure,  under  threshold  loss 
assumptions . 

The  reader  needs  to  be  cautioned  not  to  take  the  above  "two-test" 
analogy  too  literally.  It  is  offered  simply  as  an  aid  in  thinking  about 
these  statistics.  Again,  in  this  section  the  procedures  treated  involve 
a  single  administration  of  a  single  form  of  a  domain-referenced  test. 

As  noted  in  Table  6.1,  corresponding  to  each  of  these  three  threshold 
loss  statistics  there  is  a  statistic  for  squared  error  loss.  For  example, 
a2 (A)  is  the  average  squared  error  of  measurement  for  the  population  of 
examinees,  and  the  two  agreement  coefficients  for  squared  error  loss 
involve  a2 (A) .  These  squared  error  loss  statistics  provide  a  different 
perspective  on  agreement  (and  disagreement) . 


Throughout  this  section  all  reference  to  a  cutting  score,  it  ,  is 

o 

replaced  by  consideration  of  c  =  x  /n,  the  advancement  score  in  terms  of 

o  o 

proportion  of  items  correct.  That  is,  in  considering  both  squared  error 

loss  and  threshold  loss,  c^  is  sometimes  used  when  it  might  be  argued  that 

rr  should  be  involved.  To  do  so,  however,  would  necessitate  considerable 
o 

complexities,  no  matter  what  loss  function  is  involved. 

Finally,  it  should  be  noted  that  some  persons  refer  to  the  agreement 
coefficients  discussed  in  this  section  as  "reliability"  coefficients.  The 
word  "reliability"  is  not  used  here  principally  to  avoid  unwarranted  asso¬ 
ciations  between  the  coefficients  in  Table  6.1  and  classical  reliability 
coefficients  for  norm- referenced  tests.  Given  this  caveat,  however,  much 
of  this  section  treats  issues  traditionally  associated  with  measurement 
consistency,  or  "reliability"  considerations.  (Also,  in  a  sense  mentioned 
later,  these  issues  have  validity  connotations  for  domain-referenced  inter¬ 
pretations.  ) 

Squared  Error  Loss 

Squared  error  loss  statistics  are  conceptually  more  involved  than 
their  threshold  loss  counterparts.  Here,  however,  intital  consideration  is 
given  to  squared  error  loss  statistics  because  there  are  certain  computa¬ 
tional  conveniences  in  proceeding  in  this  order. 

Suppose  that  an  n  =  10  item  test  were  adminsitered  to  k  =  25  exam¬ 
inees;  and  suppose  that  after  the  items  were  scored,  the  resulting  data 


matrix  was  that  given  in  Table  6.2.  An  entry  in  this  data  matrix  is  denoted 
x  .  ,  the  score  (0  =  incorrect,  1  =  correct)  for  examinee  p  on  item  i. 

ni  — 


Table  6.2 

Group  Performance  on  a  Test: 

A  Synthetic  Data  Set  with  Sample  Statistics 


r 

Item 

|  Person 

- 

1 

2 

3 

4 

5  6 

7 

8 

9 

10 

X 

_ E _ 

r  . 

10 

1 

1  ] 

11 

1 

1  ] 

* . 

12 

1 

1  J 

13 

1 

1  ] 

14 

1 

0  ] 

15 

1 

1  ] 

16 

1 

1  ] 

17 

1 

1  ] 

18 

1 

1  ] 

i 

19 

1 

1  ] 

1 

20 

0 

0  ] 

i 

21 

1 

0  ] 

22 

1 

0  ] 

23 

0 

1  ] 

24 

1 

0  ] 

25 

0 

0  ] 

.88  .76  .96  .88  .84  .88  .80  .80  .68  .76 

s2(x,)  =  .0058 

l 


s (x . )  =  .076 

l 


x  =  .824 


s2 (x  )  =  .0282 
P 

s(x  )  =  .168 
P 
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Other  statistics  reported  in  Table  6.2  are  as  follows: 

(a)  x  is  the  proportion  of  items  that  examinee  p  got  correct; 

P 

2  -  - 

(b)  s  (x  )  and  s(x  )  are  the  variance  and  standard  deviation, 

P  P 

respectively,  of  the  scores  x^; 

(c)  x.  is  the  proportion  of  persons  who  got  item  _i  correct — i.e., 

the  item  difficulty  level  discussed  in  Section  2; 

2  -  — 

(d)  s  (x^)  and  s(x^)  are  the  variance  and  standard  deviation, 
respectively,  of  the  item  difficulty  levels;  and 

(e)  x  is  the  mean  proportion  of  items  correct  for  persons,  or, 
equivalently,  the  mean  difficulty  level  for  items. 

Using  these  sample  statistics,  Table  6.3  provides  formulas,  with 
illustrative  computations,  for  estimating  agreement  coefficients  and 
other  quantities  of  interest  involving  squared  error  loss.  (These 
formulas  are  used  here  because  they  are  as  computationally  simple  to 
use  as  any  that  can  be  derived;  however,  other  more  computationally 
difficult  formulas  would  be  better  in  terms  of  revealing  certain  under¬ 
lying  theoretical  issues.) 

Universe  score  variance .  It  has  been  emphasized  repeatedly  in 

previous  sections  that  an  examinee's  observed  score,  x  ,  is  not  neces- 

P 

sarily  equal  to  his/her  universe  score,  u  .  It  follows  that  the  vari- 

P 

ance  of  examinees'  observed  scores,  s2 (x^) ,  is  not  necessarily  equal 

to  the  variance  of  examinees'  universe  scores,  o2  (it  ),  which  is  abbrev- 

P 

iated  o2  (n)  in  Table  6.3.  Actually,  o 2  ( ir )  is  almost  always  less  than 


t he  observed  score  variance.  This  fact  is  not  immediately  evident  from 


Equation  6.1  in  Table  6.3;  but  the  computation  section  of  Table  6.3 

shows  that  o2(u)  =  .0165,  a  value  considerably  smaller  than  s2 (x  )  =  .02 

P 

Note  that  the  square  root  of  a 2  (ir)  is  simply  the  standard  deviation  of 
examinee  universe  scores,  which  is  a2  (it)  =  .129  for  the  synthetic  data. 

Error  variance.  Recall  from  Section  5  that  error  of  measurement 
is  defined  as  the  difference  between  an  examinee's  observed  and  universe 
scores ; 


A  =  x  -  IT 
p  p  p 


If  we  were  to  square  these  differences  for  all  examinees,  and  then  get 
the  average  of  these  squared  differences,  we  would  obtain  o2 (A) .  Of 
course,  tt^  is  never  known  exactly,  so  neither  is  A^;  and,  consequently, 
a2 (A)  cannot  be  obtained  directly  by  averaging  the  squared  values  of 
A  .  However,  one  can  estimate  a2 (A)  using  Equation  6.2  in  Table  6.3, 
and  the  square  root  of  this  value  is  an  estimate  of  the  standard  devia¬ 
tion  of  examinee  errors  of  measurement.  For  the  data  in  Table  6.2, 
Table  6.3  shows  that  a2  (A)  =  .0130  and  a (A)  =  .114.  It  is  not  immed¬ 
iately  evident  from  Table  6.3,  but  o2 (A)  depends  upon  the  variance 
of  item  difficulty  levels,  among  other  things.  In  general,  the  smaller 
the  variance  of  item  difficulty  levels,  the  smaller  the  value  of  a2 (A) . 


MfkttasM  fin  at  r^ii-^ni 
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Agreement  coefficient  not  corrected  for  chance .  The  above  dis¬ 
cussion  of  universe  score  variance  and  error  variance  makes  no  ref¬ 
erence  to  mastery /non-mastery  decisions.  When  such  decisions  are  to 
be  made,  the  advancement  score  plays  a  role  in  the  definition  of  an 
agreement  coefficient  not  corrected  for  chance,  although  error  variance 
is  still  a2 (A).  This  agreement  coefficient  is  defined  as: 

a2  (it)  +  (u  -  c  ) 2 
o 

$(C  )  =  -  ; 

°  cr2(7r)  +  (u  -  c  )2  +  a2  (A) 

o 

where  c^  =  x_/n  is  the  advancement  score  in  terms  of  proportion  of  items 
correct;  and  u  is  the  mean  score  over  the  universe  of  items  and  the 
population  of  persons.  As  such,  u  has  similarities  with  x,  but  is  not 
identical  to  it.  The  above  definition  is  rather  difficult  to  use  directly 
to  estimate  4> (c^) ,  so  a  simpler  formula  is  provided  by  Equation  6.3 
in  Table  6.3. 

Note  that  Equation  6.3  depends  upon  (x  -  c  )2,  the  squared  dif- 

o 

f erence  between  x  and  the  advancement  score .  For  the  synthetic  data 
with  x  =  .824  and  cq  =  9/10  =  .9,  Table  6.3  shows  that  $(.9)  =  .62. 

One  might  ask,  however,  what  would  be  the  value  of  t(c  )  if  x  actually 
equaled  c^  in  Equation  6.3?  The  answer  is  provided  by  Equation  6.4, 
which  is  also  identified  as  KR-21.  As  discussed  later,  KR-21  also 
plays  an  important  role  in  estimating  threshold  loss  agreement  coeffi¬ 
cients.  For  the  synthetic  data  KR-21  =  .54,  and  this  is  the  smallest 
value  that  Equation  6.3  can  have  for  these  data — no  matter  what  the 
advancement  score  actually  is. 
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Agreement  coefficient  corrected  for  chance .  The  agreement  coef¬ 
ficient  corrected  for  chance,  which  is  denoted  4>,  is  easily  obtained 
using  the  values  of  o2(n)  and  o2(A)  in  Equation  6.5  in  Table  6.3.  For 
the  synthetic  data,  $  =  .56,  a  value  very  close  to  KR-21  =  .54.  Indeed, 

4>  and  KR-21  almost  always  have  very  similar  values.  This  occurs  prin¬ 
cipally  because  neither  one  of  them  depends  on  chance  agreement,  which 

is  technically  (u  -  c  )2  for  squared  error  loss. 

o 

Interpreting  agreement  coefficients .  Agreement  coefficients  (and 
their  reliability  counterparts)  are  discussed  and  used  extensively 
in  educational  measurement — perhaps  too  extensively!  However, 
they  are  frequently  difficult  to  interpret  correctly,  no  matter  what 
loss  function  is  involved.  For  this  reason,  whatever  loss  function  is 
involved,  the  following  characteristics  of  such  coefficients  should 
be  kept  in  mind 

(a)  an  agreement  coefficient  generally  ranges  from  0  to  1,  but 
a  value  of  ,say,  .80  is  not  necessarily  "twice  as  good"  as  a  value  of 
.40; 

(b)  when  most  examinees  have  observed  scores  close  to  the  advance¬ 
ment  score,  an  agreement  coefficient  not  corrected  for  chance  will  be 
smaller  than  when  most  examinees  have  observed  scores  relatively  far 
from  the  advancement  score ; 

(c)  an  agreement  coefficient  will  tend  to  be  small  whenever  uni¬ 
verse  score  variance  is  small  or  error  variance  is  large  (even  if  the 
coefficient  is  based  on  threshold  loss) ; 


(d)  an  agreement  coefficient  not  corrected  for  chance  reflects 
the  quality  (or  consistency)  of  decisions  made  about  examinees,  whereas 
an  agreement  coefficient  corrected  for  chance  reflects  the  contribution 
of  the  test  to  the  quality  of  such  decisions.  This  is  another  perspective 
on  the  fact  that  a  coefficient  corrected  for  chance  is  smaller  than  its 
not-corrected-for-chance  counterpart. 

Threshold  Loss 

In  the  introduction  to  this  section  it  was  stated  that  a  threshold 
loss  function  assumes  that  all  false  negative  errors  are  equally  serious, 
and  all  false  positive  errors  are  equally  serious. 

To  clarify  this  point  let  us  suppose  that  the  test  length  is  n  =  10, 
and  c^  =  iro  =  .90.  Obviously,  an  examinee  will  not  be  advanced  if  he/she 
gets  0,  1,  2,  .  .  .,8  items  correct.  Now,  it  is  almost  certain  that 
some  of  these  examinees  will  be  falsely  classified  as  non-masters, 
because  it  is  likely  that  some  of  these  examinees  have  universe  scores 
at  or  above  .90.  (Of  course,  one  never  knows  which  examinees  are  falsely 
declared  to  be  non-masters) .  For  threshold  loss  it  is  assumed  that  any 
such  false  negative  error  is  as  serious  as  any  other  such  error,  no 
matter  what  the  examinee's  unvierse  score  actually  is;  e.g.,  failing 
an  examinee  with  a  universe  score  of  7r  =  .91  is  as  serious  an  error 
as  failing  an  examinee  with  a  universe  score  of  tt  =  1.00. 

Also,  the  threshold  loss  function  involves  assuming  that  all  false 
positive  errors  are  equally  serious.  For  the  above  example,  this  means 


that  passing  an  examinee  with  a  universe  score  of,  say,  tt  =  .40  is  as 
serious  an  error  as  passing  an  examinee  with  a  universe  score  of,  say, 
ir  =  .70. 

It  should  be  noted,  however,  that  the  threshold  loss  function 
does  not  involve  assuming  that  false  positive  errors  are  as  serious  as 
false  negative  errors.  That  issue  is  a  question  of  loss  ratio — a  sub¬ 
ject  treated  in  Section  4 . 

Table  6.4  describes  and  illustrates  the  steps  required  to  obtain 

the  threshold  loss  coefficients  p  (not  corrected  for  chance)  and  Kappa 

o  - 

(corrected  for  chance) . 

Step  1  singly  involves  recording  results  already  obtained  in  Tables 
6.2  and  6.3  for  the  synthetic  data. 

Step  2  involves  computing  a  z-score  based  on  the  advancement  score, 
cq.  For  these  data  z  =  .45  which  means  that  the  mean,  x,  is  45/lOOth's 

of  a  standard  deviation  [s(x  )  =  .168]  above  the  advancement  score. 

P 

Step  3  involves  determining  what  proportion  of  examinees  would 
have  z-scores  below  z  =  .45  if  examinee  scores  were  normally  distributed. 
To  obtain  this  result.  Table  A. 3  in  Appendix  A  is  required.  For 
the  synthetic  data,  this  proportion  is  pz  =  .67. 

Step  4  involves  determining  the  proportion  of  examinees  who  would 
have  z-scores  below  z  =  .45  on  each  of  two  (hypothetical)  n-item  tests, 
if  examinee  scores  were  normally  distributed  on  both  tests.  For  the 
synthetic  data  p  =  .53.  This  step  makes  use  of  KR-21;  and  will 


Step  6:  Compute  the  expected  proportion  of 
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always  be  less  than  p^  ,  unless  KR-21  actually  equals  one  (a  highly 
unlikely  occurrence) . 

Step  5  provides  formulas  for  estimating  p  and  Kappa  using  p  and 

o  z 

Pzz-  For  the  synthetic  data  P0  =  -72,  and  Kappa  =  .36.  Again,  Kappa 

is  smaller  than  p^  because  p^  reflects  the  proportion  of  examinees 

consistently  classified,  while  Kappa  reflects  the  proportion  of  examinees 

consistently  classified  over  and  beyond  the  proportion  that  would  probably 

be  classified  consistently  by  chance.  [The  proportion  probably 

classified  consistently  by  chance  is  1  -  2  p  (1-p),  which  is  .54 

z  z 

for  the  synthetic  data.] 

Finally,  Step  6  in  Table  6.4  provides  an  estimate  of  the  propor¬ 
tion  of  examinees  who  are  inconsistently  classified,  i.e.,  the  proportion 
of  errors  involved  in  the  decision-making  process,  in  the  sense  of 
threshold  loss  errors.  For  the  synthetic  data,  this  proportion  is 
.28. 

The  procedure  for  estimating  p^  and  Kappa  in  Table  6.4  is  based  on 
the  assumption  that  examinee  universe  scores  are  normally  distributed. 

In  many  domain-referenced  testing  contexts  this  assumption  is  probably 
not  true;  but  in  most  cases  it  is  unlikely  that  violations  of  this 
assumption  will  cause  p  and  Kappa  to  be  poorly  estimated. 

It  is  important  to  note  that  the  statistics  discussed  above  refer 
to  a  group  of  examinees — not  to  individual  examiness.  None  of  these 
statistics  specify  which  examinees  are  consistently  or  inconsistently 


classified . 


Also,  for  a  different  group  of  examinees,  and/or  a  different  sample 
of  items,  the  results  would  almost  certainly  differ.  A  similar  state¬ 
ment  applies  to  the  statistics  for  squared  error  loss  in  Table  6.3. 

Such  differences  do  not  invalidate  the  statistics  discussed  above; 
rather,  such  differences  result  because  what  we  are  really  doing  is 
estimating  quantities  (called  parameters)  that  we  cannot  observe 
directly. 

Test  Length 

Recall  that  a  domain-referenced  test  is  viewed  as  a  sample  of 
items  from  a  larger  universe  of  items  constructed  to  measure  the  con¬ 
tent  under  consideration.  Also  recall  that  the  examinee  scores  one 
would  ideally  like  to  know  are  the  examinee  universe  scores — i.e., 
examinee  scores  on  the  universe  of  items.  These  ideal  scores  can 
never  be  obtained;  but,  in  general,  longer  tests  involve  less  error  and 
provide  better  estimates  of  examinee  universe  scores. 

Therefore,  one  obvious  question  is,  "How  long  should  a  test  be?" 
There  can  be  no  universal  statistical  answer  to  this  question,  because 
any  specific  attempt  to  answer  it  eventually  involves  answering  at 
least  one  other  question — namely,  "How  much  error  is  one  willing  to 
tolerate?”  Clearly,  the  answer  to  this  latter  question  necessitates 
subjective  judgment  by  a  responsible  person  who  is  well-aware  of  all 
aspects  of  the  testing  environment  and  the  decisions  to  be  made.  Even 
so,  stat  sties  can  help  in  making  informed  subjective  judgments  about 


test  length. 
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In  jj.ir  I  i  <  u )  nr  two  such  .statistics  can  be  helpful:  (a)  o(A),  the 

standard  deviation  of  errors  of  measurement;  and  (b)  1  -  p  ,  the  pro- 

o 

portion  of  examinees  inconsistently  classified.  Table  6.5  shows  how  these 
two  statistics  can  be  estimated  for  a  hypothetical  test  of  length  n'. 
Actually,  only  Equation  6.6  in  Table  6.5  is  required  to  estimate  error 
variance  and  its  standard  deviation;  the  other  equations  and  steps  are 
required  to  obtain  the  proportion  of  examinees  inconsistently  classified. 

Note  that  in  Table  6.5  statistics  for  a  test  of  length  n'  are 
identified  with  a  prime  to  distinguish  them  from  the  corresponding 
statistics  for  the  available  n-item  test.  This  distinction  is  dropped 
in  Table  6.6  which  summarizes  results  for  test  lengths  of  n  =  10,  15, 
and  20.  (The  first  row  of  Table  6.6  simply  duplicates  results  already 
reported  in  Tables  6.3  and  6.4  for  the  10-item  test.)  From  Table  6.6 
it  is  clear  that,  as  test  length  increases,  both  a (A)  and  1  - 
decrease,  but  not  very  rapidly.  In  interpreting  a (A)  it  is  useful  to 
keep  in  mind  that  it  can  be  no  larger  than  0.25  when  each  observed 
item  score  takes  on  one  of  two  possible  values,  as  is  the  case  for  the 
the  synthetic  data  in  Table  6.2. 

The  values  of  cs  (A)  and  1  -  p  reported  in  Table  6.6  are  based  upon 

o 

synthetic  data,  but  similar  results  can  easily  occur  with  real  data. 
Furthermore,  the  values  of  o(A)  and  1  -  p  reported  in  Table  6,6  would 


probably  be  judged  rather  large  in  most  real  contexts.  Of  course, 
these  values  can  be  reduced  by  increasing  test  length  beyond  20  items. 


Note ;  All  statistics  for  a  test  of  length  n'  are  identified  with  a  prime  ( ')  to  distinguish  them  from 
the  corresponding  statistics  for  the  available  n-item  test. 
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Table  6.6 

Illustrative  Results  for  Changes 
in  Test  Length  Using  the 
Synthetic  Data  Example 


n 

0(  A) 

KR-21 

1-p 

o 

10 

.11 

.54 

.28 

15 


09 


64 


26 
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In  beginning  the  above  discussion  of  test  length,  it  was  pointed 

out  that  data,  per  se,  cannot  specify  what  the  test  length  should  be, 

but  data  can  help  in  making  an  informed,  but  still  subjective,  judgment 

about  test  length.  In  this  regard,  cr  {A)  and  1  -  p  are  helpful;  but 

o 

it  must  be  recognized  that  these  two  statistics  provide  different  types 
of  information,  and  perhaps  not  equally  useful  information  in  a  parti¬ 
cular  context.  In  the  extreme,  if  an  investigator  were  interested  only 
in  minimizing  classification  errors,  then  o(A)  would  provide  irrelevant 
information;  and,  conversely,  if  an  investigator  were  interested  only 
in  measurement  error,  then  1  -  p  would  provide  irrelevant  information. 

The  perspective  taken  above  is  that  in  most  realistic  settings, 
both  types  of  error  are  likely  to  be  of  interest;  and,  therefore,  con¬ 
sideration  has  been  given  to  both.  Only  in  a  specific  context  can  a 
judgment  be  made  concerning  which  statistic  is  more  appropriate  in 
considerations  regarding  test  length.  As  discussed  below,  a  similar 
argument  applies  to  agreement  coefficients. 

Other  Considerations 

Throughout  this  section,  squared  error  loss  and  threshold  loss 
statistics  have  been  treated  in  parallel.  If,  in  a  given  context,  an 
investigator  has  an  unambiguous  basis  for  choosing  one  loss  function 
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over  the  other,  then,  of  course,  statistics  involving  the  other  loss 
function  become  irrelevant.  However,  in  many  situations,  choice  of 
a  loss  function  may  not  be  a  completely  unambiguous  decision  and, 
indeed,  it  may  be  that  neither  loss  function  is  ideal.  In  such  situa¬ 
tions,  one  approach  is  to  examine  statistics  for  both  loss  functions, 
keeping  in  mind  the  different  assumptions  involved.  In  doing  so,  there 
is  some  potential  for  confusion,  but  a  theoretically  better  approach 
would  involve  complexities  far  beyond  the  intended  scope  of  this  hand¬ 
book  . 

In  this  regard,  it  should  be  kept  in  mind  that  it  is  not  always  the 
case  that  a  test  is  used  to  make  a  single  type  of  decision.  For  example, 
it  could  well  be  that  a  given  test  is  sometimes  used  to  make  mastery/ 
non-mastery  types  of  decisions  assuming  threshold  loss;  and,  at  other 
times,  the  test  is  used  simply  to  estimate  examinee  universe  scores 
assuming  squared  error  loss.  For  such  a  test,  both  loss  functions  are 
appropriate  depending  upon  the  use  of  the  test.  Indeed,  in  choosing 
a  loss  function,  the  question  of  importance  is  not  what  constitutes 
the  test,  but  rather  what  constitutes  the  assumptions  about  the  deci¬ 
sions  to  be  made  using  the  test. 

Sometimes  a  domain-referenced  test  is  used  solely  for  the  purpose 
of  estimating  examinee  universe  scores,  without  any  consideration  of 
a  cutting  score.  In  such  situations  (assuming  that  squared  error  loss 
is  relevant),  a( A)  is  still  appropriate,  as  is  the  index  $  given  by 


Equation  6.5  in  Table  6.3.  In  this  sense,  may  be  viewed  as  a  general- 
purpose  .igr  eernent  <:<><•  f  f  i  c i on t  ,  or  i  ndox  nf  depondabi  ]  i  t.y ,  f nr  n  domain- 
referenced  test.  Note  that  when  a  domain-referenced  test  is  used  solely 
to  estimate  examinee  universe  scores,  threshold  loss  statistics  like 
those  treated  above  are  meaningless. 

In  the  introduction  to  this  section,  reference  was  made  to  the 
fact  that  the  agreement  coefficients  discussed  above  are  sometimes 
called  reliability  coefficients.  Actually,  these  agreement  coefficients 
carry  with  them  a  connotation  of  validity,  too,  in  the  sense  that  they 
involve  consideration  of  the  universe  of  items  which  is  often  the 
principal  "criterion"  of  interest,  or  the  only  criterion  available. 
Indeed,  one  perspective  on  measurement  suggests  that  notions  of  reli¬ 
ability  and  validity  can  be  blended  together  into  a  consideration  of  the 
extent  to  which  observed  scores  are  qeneralizable  to  universe  scores. 
This  perspective  seems  especially  relevant  for  domain-referenced  inter¬ 
pretations  of  test  scores.  In  this  sense,  this  section  has  considered 
issues  relevant  to  both  reliability  and  validity. 
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Appendix  A 

Tables 

Table  A.l  is  based  on  the  Fhaner-Wilcox-Huynh  procedure  referenced 
in  Appendix  B.  This  table  was  developed  using  the  IMSL  (1979)  subrou¬ 
tine  MDBETA. 

The  results  reported  in  Table  A. 2  are  based  on  the  assumptions  of 
binomial  likelihood  and  a  uniform  beta  prior  (see  Appendix  B) .  The 
probabilities  reported  in  Table  A. 2  were  obtained  using  the  IMSL  (1979) 
subroutine  MDBETA:  and  the  credibility  intervals  were  obtained  using 
CADA  [Isaacs  and  Novick,  and  Jackson  (19'74)],  and  some  calculus. 

Table  A. 3  was  developed  using  the  IMSL  (1979)  subroutines  MDBNOR 
and  MDNOR. 
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NOTE.  THE  WIDTH  OF  EACH  INTERVAL  IS  INDICATED  IN  PARENTHESES  BELOW  THE  LIMITS  OF  THE  INTERVAL. 

ENTRIES  IN  THE  TABLE  ARE  ADVANCEMENT  SCORES  FOR  LOSS  RATIOS  OF  1.  2»  AND  3»  RESPECTIVELY.  FOR  EXAMPLE »  A  LOSS 
RATIO  OF  2  IS  APPROPRIATE  IF  FALSE  POSITIVE  ERRORS  ARE  TWICE  AS  SERIOUS  AS  FALSE  NEGATIVE  ERRORS. 


ADVANCEMENT  SCORES  FOR  VARIOUS  INDIFFERENCE  ZONES 
ORDERED  ACCORDING  TO  INTERVAL  MID-POINTS 
FOR  LOSS  RATIOS  OF  If  2>  AND  3 
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Table  A. 2  (Continued) 


Probability  that  Two  Standard  Normal 
Variables,  with  Correlation  Equal  to  KR-21 
are  Both  Less  Than  or  Equal  to  z 
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Probability  that  Two  Standard  Normal 
Variables,  with  Correlation  Equal  to  KR-21 
are  Both  Less  Than  or  Equal  to  z 
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Appendix  B 


Technical  Notes 

These  notes  are  provided  for  two  reasons:  (a)  to  cite  appropriate 
technical  background  and  references  for  each  section  of  the  handbook; 
and  (b)  to  provide  a  limited  amount  of  technical  justification  for 
equations  and/or  procedures  that  are  not  specifically  reported  in  readily 
available  references.  However,  there  is  no  intent  to  cite  all  potentially 
relevant  references  or  to  verify  iji  detail  all  equations  and/or  proce¬ 
dures  . 

In  the  body  of  this  handbook,  distinctions  have  been  drawn  only 
very  rarely  between  parameters  and  estimates  of  parameters.  In  these 
technical  notes  such  distinctions  are  made  through  the  use  of  a  "hat" 

(')  above  unbiased  estimates  of  parameters,  which  are  denoted  by  Greek 
letters.  The  reader  should  be  careful  not  to  confuse  this  use  of  a 
"hat"  with  the  use  already  made  of  this  symbol  in  the  body  of  the  hand¬ 
book.  Specifically,  the  "hat"  symbol  is  also  used  to  distinguish  be¬ 
tween  the  sample  variances  s2  and  s2,  where  the  former  involves  a  denom¬ 
inator  of  n  and  the  latter  involves  a  denominator  of  n  -  1.  (Of  course, 
s2  is  an  unbiased  estimate  of  a  parameter,  but  usually  not  a  parameter 
of  interest,  here.) 

Section  1^ 

Berk  (1980)  provides  an  edited  book  of  readings  on  the  subject  of 


domain-referenced  (or  criterion-referenced)  measurements.  Most  of  the 


topics  treated  in  this  handbook  are  also  covered  in  Berk  (1980) .  Also, 
Hambleton,  Swaminathan,  Algina,  and  Coulson  (1978)  provide  a  technical  re¬ 
view  of  many  issues  treated  here;  Millman  (1979)  provides  a  brief  review 
written  principally  for  practitioners;  and  Nitko  (1980)  reviews  the  many 
varieties  of  criterion-referenced  tests.  It  should  be  noted,  however, 
that  there  are  clear  differences  between  this  handbook  and  the  above 
references — differences  in  emphasis  and  scope,  as  well  as  occasional 
differences  in  perspective  and  approach. 

Many  introductory  measurement  textbooks  give  considerable  attention 
to  defining  objectives  and  tables  of  specifications.  Recently,  Ellis 
and  Wulfeck  (1979)  and  Ellis,  Wulfeck,  and  Fredericks  (1979)  have  devel¬ 
oped  a  task/content  matrix  for  specific  use  in  Navy  training  that  in¬ 
volves  domain- referenced  testing. 

Section  2 

Most  introductory  measurement  textbooks  provide  detailed  discussion 
of  item  analysis  procedures.  Even  though  such  discussions  usually  empha¬ 
size  norm-referenced  testing,  many  of  the  guidelines  typically  suggested 
are  relevant  for  domain-referenced  testing,  too — with  one  noticeable 
exception.  In  the  opinion  of  this  author,  it  is  not  generally  a  good 
practice  in  domain-referenced  testing  to  select  items  in  a  systematic 
manner  so  as  to  obtain  some  pre-specif ied  distribution  of  item  difficulty 
levels  and/or  discrimination  indices.  More  specifically,  this  is  not  a 
good  practice  if  a  test  is  to  be  used  solely  for  the  purpose  of  making 
domain- referenced  interpretations  of  test  scores. 


The  discrimination  index,  B,  discussed  in  Section  2  is  treated  by 
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Brennan  (1972) .  More  recently,  Harris  and  Wilcox  (1980)  have  commented 
on  this  index. 

Section  _3 

The  procedure  suggested  in  Section  3  for  establishing  a  cutting  score 
is  a  slight  modification  of  a  procedure  originally  proposed  by  Angoff 
(1971);  and  the  developments  involving  a (y)  are  discussed  by  Brennan 
and  Lockwood  (1980).  The  specific  equations  for  a (y)  in  Table  3.2  can 
be  derived  in  the  manner  outlined  below. 

Let  the  probability  assigned  by  rater  r  (r=l,  2,  ...»  t)  to  item  :i 
(i=l,  2,  . . . ,  m)  for  a  set  of  m  items  be: 

y.  =  X  +  A'v<+A.'^  +  A.'v 
ri  r  i  rx 

where  A  is  the  grand  mean  and  the  A'v  are  score  effects  as  discussed  by 
Brennan  and  Lockwood  (1980) .  It  can  be  shown  that  inbiased  estimates 
of  the  variance  of  these  score  effects,  in  terms  of  the  sample  statis- 


reported  in 

Table  3.2, 

are : 

o2(ri) 

=  [Z  s? (y 
r 

ri)  -  t  S2(y.)]/(t-l) 

(Bl) 

a2  (r) 

=  s2(yr) 

-  a2(ri)/m 

(B2) 

a2  (i) 

=  s2(yi) 

-  a2 (ri)/t  . 

(B3) 

— J 


For  random  samples  of  t:  raters  and  random  samples  of  ri  items  ( n  need 
not  equal  m)  an  unbiased  estimate  of  a2 (y)  is: 

a2(r)  a2(i)  a2 (ri) 

a2  (y)  =  -  +  -  +  -  .  (B4) 

t  n  nt 

Using  Equations  B1  to  B3  in  B4  we  obtain 

s2(y.)  s2(y  )  E  s2 (y  .)  s2(y.) 

i  r  i  n  1 

S2(y)  =  -  + - -  -  ,  (B5) 

n  t  mt(t-l)  m(t-l) 

where  the  bracketed  term  in  Equation  B5  is  a2(ri)/tm,  which  constitutes 
the  A-term  defined  in  Table  3.2.  The  square  root  of  Equation  B5  is 
Equation  3.2  in  Table  3.2;  and  when  n  equals  m,  the  square  root  of 
Equation  B5  is  Equation  3.1  in  Table  3.2. 

Finally,  as  n  <*>,  it  is  evident  from  Equation  B4  that 

n2(y)  =  o2( r)/t  ; 

and  using  Equation  B2, 

o2(y)  =  s2(yr)/t  -  o2(ri)/mt 

=  s2(yr)/t  -  A  .  (B6) 

The  square  root  of  o2 (y)  in  Equation  B6  is  Equation  3.3  in  Table  3.2. 
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Section  _4 

Table  A.  1 ,  which  is  discussed  in  Section  4,  results  from  applying 
a  minimax  procedure  presented  in  Huynh  (1980,  pp.  170-171).  As  such 
this  procedure  is  basically  an  extension  of  an  approach  suggested 
by  Fhaner  (1974)  and  treated  by  Wilcox  (1976).  It  should  be  noted, 
however,  that  where  Huynh  talks  about  the  loss  ratio  Q,  this  author 
talks  about  1/Q;  e.g.,  if  false  positive  errors  are  twice  as  serious 
as  false  negative  errors,  Huynh  says  the  loss  ratio  is  Q  =  .50,  and  in 
Section  4  this  loss  ratio  is  identified  as  1/.50  =  2.  Of  course,  this 
difference  is  simply  a  question  of  definition. 

It  is  suggested  in  Section  4  that  a  confidence  interval  for  it 
from  a  cutting  score  study  be  considered  as  one  possible  way  to  define 
an  indifference  zone.  In  doing  so,  it  might  be  argued  that  one  is 
implicitly  violating  the  assumption  of  0  -  1  referral  loss,  which  is 
an  assumption  made  by  Huynh  (1980)  in  his  formulation  of  the  minimax 
procedure  used  to  generate  Table  A.l.  Another  approach  that  might 
be  considered  is  to  eliminate  the  indifference  zone  and  use,  y  and 
o (y)  from  a  cutting  score  study  to  establish  an  ogive-shaped  referral 
success  function,  but  this  is  considerably  more  complicated  than  the 
approach  taken  in  this  handbook. 

Section  5 

With  respect  to  technical  issues.  Section  5  is  based  principally  on 
Table  A. 2  which  was  developed  under  the  assumptions  of  a  binomial  like¬ 
lihood  and  a  uniform  beta  prior  distribution  for  ir  (sometimes  called 


a  non-informative  prior) . 


Specifically,  an  entry  in  the  left-hand  part  of  Table  A. 2  is: 


Prob  (it  >  it  n,  x  )  =  1-I(x  +  1,  n  -  x  +  1)  ; 

P  —  P  IT  P  P 


where  x  and  it  are  an  examinee 1 s  observed  and  universe  scores ,  respec- 

p  p  •  f 

tively;  and  I  (x  +  1,  n  -  x  +  1)  is  the  incomplete  beta  function  with 

IT  p  P 

parameters  x^  +  1  and  n  -  x  +  1.  An  entry  in  the  right-hand  part  of 
Table  A. 2  is  a  Bayesian  credibility  interval  for  it  under  the  assumption 
of  a  uniform  beta  prior  distribution.  Technically,  these  intervals  are 
called  highest  density  regions.  (Some  might  quarrel  with  calling  an 
interval  a  highest  density  region  when  n  =  x.)  Readers  unfamiliar  with 
these  Bayesian  concepts  can  consult  Novick  and  Jackson  (1974,  Chapter  5.) 

A  principal  reason  for  using  a  beta  prior  here  is  that  this  assump¬ 
tion  results  in  a  Bayesian  credibility  interval,  which  enables  one  to 
make  probability  statements  about  the  parameter  it.  By  contrast,  a 
confidence  interval  allows  one  to  make  probability  statements  about 
intervals  covering  it.  Some  might  argue  that  in  specific  contexts,  a 
uniform  beta  prior  is  frequently  unrealistic  because  a  decision-maker 
may  know  a  great  deal  about  an  examinee.  However,  to  assess  "informative1 
(i.e.,  non-uniform)  beta  priors  in  a  decision-making  process  virtually 
necessitates  an  interactive  computing  system  such  as  CADA  (Isaacs  and 
Novick,  1978) .  Furthermore,  a  decision-maker  would  need  to  justify  the 
specific  "informative"  prior  chosen  in  each  and  every  individual  case. 
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It  should  be  noted  Ui.jI,  the  Fhuner-Wi  1  eox-lluynh  approach  to  estab¬ 
lishing  an  advancement  score,  discussed  in  Section  4,  involves  consider¬ 
ation  of  false  positive  and  false  negative  errors,  but  a  uniform  beta 
prior  is  not  assumed  in  their  approach.  There  is,  therefore,  a  degree  of 
discontinuity  between  Sections  4  and  5.  (For  the  purpose  of  establish¬ 
ing  an  advancement  score,  a  uniform  beta  prior  assumption  for  a  group 
of  examinees  seems  highly  unrealistic  to  this  author.  One  might  argue 
that  an  informative  beta  prior  could  be  used,  but,  as  indicated  previously, 
the  process  of  doing  so  is  far  from  trivial  and  clearly  beyond  the  scope 
of  this  handbook.) 

Section  6 

The  theoretical  framework  used  in  Section  6  for  integrating  squared 
error  loss  and  threshold  loss  approaches  is  provided  by  Kane  and  Brennan 
(1980) .  In  addition,  a  considerable  number  of  papers  have  been  published 
that  involve  consideration  of  one  loss  function  or  the  other. 

Concerning  threshold  loss,  the  following  publications,  among  others, 
are  relevant:  (a)  Hambleton  and  Novick  (1973)  provided  the  first  inte¬ 
grated  treatment  of  threshold  loss  and  domain-referenced  testing  issues; 

(b)  Swaminathan,  Hambleton,  and  Algina(1974)  suggested  using  coefficient 
Kappa;  (c)  Huynh  (1976)  and  Subkoviak  (1976)  provided  procedures  for 
estimating  threshold  loss  coefficients  based  on  a  single  test;  and  (d) 
Subkoviak  (1980)  has  reviewed  much  of  the  work  in  this  area. 

Concerning  squared  error  loss,  the  following  publications,  among 


others,  are  relevant:  (a)  using  classical  test  theory  assumptions. 


Livingston  (1972)  proposed  a  reliability-like  coefficient  for  domain- 


referenced  tests;  (b)  using  generalizability  theory,  Brennan  and  Kane 
(1977  a,  b)  proposed  two  coefficients  and  a  definition  of  error  vari¬ 
ance;  (c)  Brennan  (1979a)  has  provided  a  computer  program  for  performing 
computations  involving  squared  error  loss  considerations  with  domain- 
referenced  testing;  and  (d)  Brennan  (1980b) has  reviewed  much  of  the  work 
in  this  area. 

The  formulas  in  Table  6.3  are  computationally  easy  to  use,  but  they 
are  rather  unusual  expressions  for  estimates  of  their  respective  para- 
maters.  For  this  reason,  the  derivations  of  these  expressions  are 
briefly  outlined  below. 

Let  the  observed  score  for  person  p  (p=l,  1,  ...,  k)  on  item  i 
(l— 1,  2,  . . • ,  n)  be: 


X  . 
Pi 


V  + 


ir  'v* 
P 


+  TTf) 


PI 


where  p  is  the  grand  mean  in  the  population  of  persons  and  universe  of 

items;  ir  ^  is  the  score  effect  for  person  p  (ir  =  p  +  ir  ”v) ;  g.'v  is  the 
P  P  P  i 

score  effect  for  item  i;  and  irg  is  the  effect  for  the  interaction 

Pi 

of  person  p  and  item  i,  which  is  confounded  with  experimental  error. 

(See  Brennan  and  Kane,  1977  a,  for  more  detail.) 

It  is  well-known  that  an  unbiased  estimate  of  a2  (it)  is: 


a2  (it)  =  [MS  (p)  -  MS  (pi)  ]/k 


(B7) 
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where  "MS"  is  "mean  square";  and,  it  is  relatively  easy  to  show  that, 
for  dichotomous  data.  Equation  B6  can  be  expressed  as  Equation  6.1  in 
Table  6.3.  In  a  similar  manner,  it  can  be  shown  that 


a2  (6) 


n[k  s2(x.)  +  s2(x  )  -  x (1-x) ] 
1  P 

(n  -  1)  (k  -  1) 


(B8) 


and 


n  k  [x(l-x)  -  s2 (x  )  -  s2 (x . ) ] 
P  l 

o2(  7,3)  =  - 

(n  -  1)  (k  -  1) 


(B9) 


Now, 


a2  (A)  =  [a2  (g)  +  a2(irB)]/n 


(BIO) 


and  replacement  of  Equations  B8  and  B9  in  BIO  gives  (after  simpli¬ 
fying  terms)  Equation  6.2  in  Table  6.3. 

Brennan  and  Kane  (1977a)  report  that  a  consistent  estimate  of 

$(c  )  is: 
o 


i(c  ) 

o 


1  - 


n-1 


x(l-x)  -  s2(x  ) 
P 


(x-c  ) 2  +  s2 (x  ) 

o  P  , 


(Bll) 


=  1  -  i 


[x(l-x)  -  s2 (x  )]/(n-l) 
P 


(x-c  ) 2  +  s2 (x  ) 

o  P 


(B12) 


The  numerator  of  the  term  in  braces  is  simply  o2  (A)  given  by  Equation  6.2 
in  Table  6.3;  consequently,  Equation  Bll  can  be  expressed  as  Equation  6.3 


in  Table  6.3.  [Technically,  02(A)  in  Equation  6.3  should  be  a2(A);  but, 
as  previously  stated,  notational  distinctions  between  parameters  and  esti¬ 
mates  are  not  made  in  the  body  of  this  handbook.]  Equation  6.4  follows 

A  _ 

from  the  fact  that  $(c  )  equals  KR-21  if  c  =  x  (see  Brennan,  1977).  The 

o  o 

expression  for  KR-21  in  Equation  6 . 3  may  appear  strange  because  it  invol- 

A  ft 

ves  <jmA)  ,  but  it  is  easily  verified  that  this  expression  is  algebra¬ 
ically  identical  to  the  well-known  expression  for  KR-21. 

The  steps  provided  in  Table  6.5  for  obtaining  estimates  of  thres¬ 
hold  loss  coefficients  of  agreement  are  based  on  Huynh's  (1976)  normal 
approxmimation  procedure  (see,  also,  Subkoviak,  1980) ,  without  using 
an  arcsine  transformation  (see  Peng  &  Subkoviak,  in  press).  In  Table  6.5 
reference  is  made  to  using  the  "closest"  value  in  Table  A. 3;  alternatively, 
one  can  obtain  better  estimates  using  linear  interpolation  (see  Huynh, 

1978 — different  context,  but  same  process) .  Huynh  (1978)  provides  a 
computer  program  for  estimating  threshold  loss  coefficients;  as  well 

as  tables  of  estimates  of  p  ,  Kappa,  and  their  standard  errors  for 

o 

test  lengths  of  5  to  10  items  (see,  also,  Huynh  &  Saunders,  1980). 

Since  the  procedure  outlined  in  Table  6.4  is  based  on  a  normal  approx 
imation,  estimates  obtained  using  this  procedure  may  be  somewhat  biased. 
However,  the  degree  of  bias  is  likely  to  be  small  unless  n  is  quite 
small  and/or  c  is  quite  close  to  one. 
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In  Table  6.5,  Equation  6.6  is  simply  [o2(B)  +  o2(Tig)]/n';  and  the 
remaining  equations  and  steps  constitute  a  somewhat  ad  hoc  approach  for 
using  Huynh's  normal  approximation  procedure  to  estimate  the  proportion 
of  inconcsistent  decisions  for  a  test  of  length  n'. 

Brennan  and  Kane  (1977b)  show  that  a2 (A)  is  algebraically  equal  to 

A 

the  average  of  the  squared  values  of  a (A  )  in  Table  5.4.  Note  also  that 

P 

0(A  )  is  identical  to  Lord's  (1957)  formula  for  the  standard  error  of 
P 

measurement  of  an  examinee's  mean  score. 
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